Tensors
Tensors: The Building Blocks of Machine Learning
A tensor is a generalization of vectors and matrices. used by libraries like PyTorch and TensorFlow to encode language. The Hierarchy of Dimensions
0D- Scalar 1D- Vector (a word) 2D- Matrix (a sentence) 3D- Stack of matrices (several sentences)
Why Tensors?
GPUs Automatic Differentiation (Autograd)
Tensors in PyTorch: Syntax & Operations
Creating Tensors
use PyTorchimport torch tensor_a = torch.tensor([1,2,3]) random_tensor = torch.randn(2, 3) # a 2x3 matrix (2 words, 3 features each)The Three Most Important Attributes
- Shape ( .shape ) The dimensions of the tensor.
- Data Type ( .dtype )
- Device ( .device )
Common Tensor Operations in NLP
- view() / reshape()
- squeeze() / unsqueeze()
- cat()
Representing Words as Numbers
Text Feature Extraction
Convert unstructured text into a numerical form to train machine learning models.One-Hot Encodings
Each word in the vocabulary gets a vector with all zeros except for a single 1.Bag of Words (BoW)
Represents text by the frequency of each word in the document, ignoring grammar and word order.TF-IDF (Term Frequency – Inverse Document Frequency)
Enhances BoW by weighing words based on how unique they are across documents.
- Term Frequency (TF): How often a word appears in a document.
- Inverse Document Frequency (IDF): How rare the word is across all documents.
- Pros: Reduces the impact of common words like “the” or “is”.
- Cons: Still does not capture semantic meaning or word relationships.
- High TF-IDF: the word is important in a specific document but rare across all documents.
- Low TF-IDF: the word is common across documents (like stopwords), or appears rarely in this document.
Word Embeddings
Transform words into dense vector representations in a high-dimensional space where similar words are close together.
Semantic Analysis and the Nature of Meaning
The Challenge of Word Sense Disambiguation (WSD)
In English, many words are polysemous, meaning they have multiple meanings depending on the context. Word Sense Disambiguation (WSD) is the task of determining which "sense" (meaning) of a word. The Distributional Hypothesis
You shall know a word by the company it keeps.Word2Vec and GloVe: Under the Hood
Word2Vec: Uses a predictive approach. GloVe (Global Vectors): Uses a count-based approach. The "Aha!" Moment: Analogies and Understanding
The "black box" of the neural network has internally organized language into a structure that mirrors human concepts of gender, royalty, verb tense.
Training and Using NLP Models
Linear Models
What is an AI “model”?
AI models estimate or predict outcomes. f̂(X)=ŷ Don’t let perfect be the enemy of the good
The total error for any model is the sum of these components:Bias: Bad model Variance: Bad training data Irreducible Error: Who knows Model size and parameter count
Every machine learning course begins with linear models.Every local model is linear
Gradient Descent- fundamental optimization algorithm based on derivatives that is used to train deep learning networks This calculus forms the basis for XAI (Explainable AI) models like LIME (Local Interpretable Model-agnostic Explanations). The Local Limit
Once researchers introduced a non-linear activation functions (like the sigmoid), neural networks could capture complex relationships the Universal Approximation Theorem claims that a network with a single hidden layer can model any arbitrary function.
Tensors: The Building Blocks of Machine Learning
A tensor is a generalization of vectors and matrices to potentially higher dimensions. The Hierarchy of Dimensions0-D Tensor (Scalar): A single number. 1-D Tensor (Vector): An array of numbers. 2-D Tensor (Matrix): An array of vectors. 3-D Tensor: A stack of matrices.
Why Tensors?
- GPU Acceleration
- Automatic Differentiation (Autograd) - PyTorch tensors can keep a history of the operations performed on them. This allows the framework to automatically calculate gradients (derivatives) needed for backpropagation to train neural networks.
Tensors in PyTorch: Syntax & Operations
The Three Most Important Attributes
- Shape ( .shape ) The dimensions of the tensor.
- Data Type ( .dtype )
- Device ( .device )
Common Tensor Operations in NLP
view() / reshape() : Changing the dimensions. squeeze() / unsqueeze() : Removing or adding dimensions of size 1. cat() (Concatenate): Joining tensors together
Neural Networks
A typical neural network consists of three types of layers:Input Layer: Receives the initial data (features) to be processed Hidden Layers: One or more intermediate layers where most of the computation occurs Output Layer: Produces the final prediction or classification How Neural Networks Learn
learn implicitly through a process called backpropagation.Forward pass: Input data travels through the network, producing a prediction Error calculation: The difference between the prediction and actual value is calculated Backward pass: The error is propagated backwards through the network Weight adjustment: Connection weights are updated to reduce the error Why are Neural Networks so widely used?
Neural networks excel at:Neural networks represent a significant shift from our previous models:Finding complex, non-linear patterns in data Processing high-dimensional data like images, audio, and text Learning hierarchical features automatically Making predictions with high accuracy when properly trained Unlike linear regression, neural networks can model highly non-linear relationships Unlike decision trees, which create explicit decision boundaries, neural networks learn implicit boundaries that can take virtually any shape Like random forests, neural networks can achieve high accuracy, but they do so through a fundamentally different approach Deep Learning
Traditional neural networks have just one or two hidden layers. Deep neural networks can have hundreds of layers. With each additional layer, deep neural networks can represent increasingly abstract concepts
Optimization Objectives vs. Evaluation Metrics
In Neural Network training, we have two different scorecards: Loss and Accuracy.Accuracy
Did the model pick the correct class? Accuracy is not good for use in machine learning and gradient-based optimization. If Model A gives the correct answer with 51% confidence and Model B gives the correct answer with 99% confidence, then Accuracy views these models as identical (both 100% correct). Loss
How confident was the model in the correct answer? Accuracy is like a True/False quiz. Loss is like the game Hot or Cold. Cross-Entropy is the gold standard loss function. Perplexity is the exponentiation of the Cross-Entropy Loss. It represents how "confused" the model is. A perplexity of 10 means the model is as unsure as if it were choosing uniformly between 10 possibilities. PyTorch combines LogSoftmax and NLLLoss into a single module: nn.CrossEntropyLoss
Recurrent Neural Networks (RNN)
Feed-Forward Networks look at the input right now in isolation. They have no concept of what happened before RNNs maintain a "Hidden State" (a short-term memory) that carries over from the previous step. They interpret the current input in the context of the past. RNNs feed their own output back into itself. RNNs make decisions based on current input and previous state. This loop allows the network to carry information forward through time.
Foundation Models & Generative AI
Foundation models are large-scale machine learning models trained on massive and diverse datasets that can be adapted to a wide range of downstream tasks. GPT-Main characteristics of foundation models include:Generative: creates output text, images, audio, or video Pre-Trained: Trained on a vast array of content, often obtained from various Internet sources. Transformer: The model transforms sequences of input tokens into sequences of output tokens. Generative models include:Massive scale Self-supervised learning Few-shot / zero-shot learning Generative AI models use deep neural networks to learn patterns in training data.Language models Image models Music models Multimodal models: Combine text, image, video, or audio GPT process-Attention mechanism is trained to determine what aspects in the prompt to focus on. Language models predict the next word in a sentence. An image model predicts pixel patterns or latent representations that match a given text prompt. Parse input into tokens Convert to word embeddings Add positional encodings Multi-head attention mechanism Feed-forward neural network
Evaluation of NLP Systems
Evaluating Generative AI and Question Answering Systems
Unlike classification tasks where a single "correct" answer exists, generative tasks require assessing nuance, factual accuracy, retrieval quality, and safety.
Evaluation requires rigorous testing using a combination of deterministic metrics, LLM-as-a-judge, and human evaluation.
The Shift from Gold Labels to Open-Ended Generation
In traditional machine learning, we compared a prediction to a single "Gold Label" (e.g., Sentiment Analysis: Positive vs Negative ). In Generative NLP, there is no single correct answer.
For example, if the user asks "Explain quantum physics," a 5-year-old’s explanation and a PhD-level explanation are both "correct" but serve different intents.
There is also the hallucination problem where generative models can produce text that is fluent and grammatically perfect but factually wrong.
To evaluate generative AI systems in the context of question answering use cases, we must evaluate two distinct components independently:Did the model use or find the right source information? Did the model generate, or write, a good answer based on that information?
Evaluation Metrics
BLEU (BiLingual Evaluation Understudy): Precision-based. Counts how many ngrams in the candidate appear in the reference. Limitation: Fails to capture synonyms. "The car is fast" and "The vehicle is quick" have low BLEU scores despite identical meaning. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Recall-based. Common in summarization. ROUGE-L: Measures the longest common subsequence to check for sentence structure similarity. METEOR: An improvement over BLEU that matches synonyms and stemmed words (e.g., "running" matches "run").
Semantic & Model-Based Metrics (The Modern Standard)
These metrics use embeddings or models to measure meaning rather than exact word matches.BERTScore: Uses contextual embeddings (like BERT) to calculate the similarity between the generated text and reference text. It is robust to paraphrasing. Cosine Similarity: Converts both the generated answer and the reference answer into vectors and measures the angle between them. MQA-Metric: A newer model-based metric specifically designed for QA that prioritizes semantic accuracy over syntactic similarity.
The RAG Evaluation Framework
Context Relevance (Precision of Retrieval)- Did the retrieval system fetch useful information? Groundedness / Faithfulness (Hallucination Check)- Is the answer supported entirely by the retrieved context? Use an LLM-as-a-Judge to verify that every claim in the generated answer has a citation in the source text. Answer Relevance (User Satisfaction)- Did the system actually answer the user's question?
LLM-as-a-Judge
This technique can be automated. You use a stronger model to grade the outputs of a smaller model.Pros: Fast, scalable, and correlates better with human judgment than BLEU/ROUGE. Cons: Can be biased (LLMs prefer longer answers) and "generous" compared to human experts.
Human Evaluation (The Gold Standard)
Expensive and time consuming.Expert Review Crowdsourcing A/B Testing Human vs. AI Reviewers
Standard Benchmarks
You should be familiar with the major datasets used to compare models:MMLU (Massive Multitask Language Understanding): Tests general knowledge across 57 subjects TruthfulQA: Specifically designed to test if models mimic human falsehoods or generate hallucinations. HumanEval: A benchmark for code generation capabilities. GPQA: "Google-Proof" Question Answering – questions so hard they cannot be answered by a simple search