NLP Notes

Tensors

Tensors: The Building Blocks of Machine Learning

A tensor is a generalization of vectors and matrices.

used by libraries like PyTorch and TensorFlow to encode language.

The Hierarchy of Dimensions

0D- Scalar

1D- Vector (a word)

2D- Matrix (a sentence)

3D- Stack of matrices (several sentences)

Why Tensors?

GPUs

Automatic Differentiation (Autograd)

Tensors in PyTorch: Syntax & Operations

Creating Tensors
use PyTorch
import torch
tensor_a = torch.tensor([1,2,3])
random_tensor = torch.randn(2, 3) # a 2x3 matrix (2 words, 3 features each)
The Three Most Important Attributes

Shape ( .shape ) The dimensions of the tensor.

Data Type ( .dtype )

Device ( .device )

Common Tensor Operations in NLP

view() / reshape()

squeeze() / unsqueeze()

cat()

Representing Words as Numbers

Text Feature Extraction

Convert unstructured text into a numerical form to train machine learning models.
One-Hot Encodings

Each word in the vocabulary gets a vector with all zeros except for a single 1.

Bag of Words (BoW)

Represents text by the frequency of each word in the document, ignoring grammar and word order.

TF-IDF (Term Frequency – Inverse Document Frequency)

Enhances BoW by weighing words based on how unique they are across documents.

Term Frequency (TF): How often a word appears in a document.

Inverse Document Frequency (IDF): How rare the word is across all documents.

Pros: Reduces the impact of common words like “the” or “is”.

Cons: Still does not capture semantic meaning or word relationships.

High TF-IDF: the word is important in a specific document but rare across all documents.

Low TF-IDF: the word is common across documents (like stopwords), or appears rarely in this document.

Word Embeddings

Transform words into dense vector representations in a high-dimensional space where similar words are close together.

Semantic Analysis and the Nature of Meaning

The Challenge of Word Sense Disambiguation (WSD)

In English, many words are polysemous, meaning they have multiple meanings depending on the context.

Word Sense Disambiguation (WSD) is the task of determining which "sense" (meaning) of a word.

The Distributional Hypothesis

You shall know a word by the company it keeps.

Word2Vec and GloVe: Under the Hood

Word2Vec: Uses a predictive approach.

GloVe (Global Vectors): Uses a count-based approach.

The "Aha!" Moment: Analogies and Understanding

The "black box" of the neural network has internally organized language into a structure that mirrors human concepts of gender, royalty, verb tense.

Training and Using NLP Models

Linear Models

What is an AI “model”?

AI models estimate or predict outcomes.

f̂(X)=ŷ

Don’t let perfect be the enemy of the good

The total error for any model is the sum of these components:

Bias: Bad model

Variance: Bad training data

Irreducible Error: Who knows

Model size and parameter count

Every machine learning course begins with linear models.

Every local model is linear

Gradient Descent- fundamental optimization algorithm based on derivatives that is used to train deep learning networks

This calculus forms the basis for XAI (Explainable AI) models like LIME (Local Interpretable Model-agnostic Explanations).

The Local Limit

Once researchers introduced a non-linear activation functions (like the sigmoid), neural networks could capture complex relationships

the Universal Approximation Theorem claims that a network with a single hidden layer can model any arbitrary function.

Tensors: The Building Blocks of Machine Learning

A tensor is a generalization of vectors and matrices to potentially higher dimensions.
The Hierarchy of Dimensions

0-D Tensor (Scalar): A single number.

1-D Tensor (Vector): An array of numbers.

2-D Tensor (Matrix): An array of vectors.

3-D Tensor: A stack of matrices.

Why Tensors?

GPU Acceleration

Automatic Differentiation (Autograd) - PyTorch tensors can keep a history of the operations performed on them. This allows the framework to automatically calculate gradients (derivatives) needed for backpropagation to train neural networks.

Tensors in PyTorch: Syntax & Operations

The Three Most Important Attributes

Shape ( .shape ) The dimensions of the tensor.

Data Type ( .dtype )

Device ( .device )

Common Tensor Operations in NLP

view() / reshape() : Changing the dimensions.

squeeze() / unsqueeze() : Removing or adding dimensions of size 1.

cat() (Concatenate): Joining tensors together

Neural Networks

A typical neural network consists of three types of layers:

Input Layer: Receives the initial data (features) to be processed

Hidden Layers: One or more intermediate layers where most of the computation occurs

Output Layer: Produces the final prediction or classification

How Neural Networks Learn
learn implicitly through a process called backpropagation.

Forward pass: Input data travels through the network, producing a prediction

Error calculation: The difference between the prediction and actual value is calculated

Backward pass: The error is propagated backwards through the network

Weight adjustment: Connection weights are updated to reduce the error

Why are Neural Networks so widely used?

Neural networks excel at:

Finding complex, non-linear patterns in data

Processing high-dimensional data like images, audio, and text

Learning hierarchical features automatically

Making predictions with high accuracy when properly trained

Neural networks represent a significant shift from our previous models:

Unlike linear regression, neural networks can model highly non-linear relationships

Unlike decision trees, which create explicit decision boundaries, neural networks learn implicit boundaries that can take virtually any shape

Like random forests, neural networks can achieve high accuracy, but they do so through a fundamentally different approach

Deep Learning

Traditional neural networks have just one or two hidden layers. Deep neural networks can have hundreds of layers.

With each additional layer, deep neural networks can represent increasingly abstract concepts

Optimization Objectives vs. Evaluation Metrics

In Neural Network training, we have two different scorecards: Loss and Accuracy.
Accuracy

Did the model pick the correct class?

Accuracy is not good for use in machine learning and gradient-based optimization.

If Model A gives the correct answer with 51% confidence and Model B gives the correct answer with 99% confidence, then Accuracy views these models as identical (both 100% correct).

Loss

How confident was the model in the correct answer?

Accuracy is like a True/False quiz. Loss is like the game Hot or Cold.

Cross-Entropy is the gold standard loss function.

Perplexity is the exponentiation of the Cross-Entropy Loss. It represents how "confused" the model is.

A perplexity of 10 means the model is as unsure as if it were choosing uniformly between 10 possibilities.

PyTorch combines LogSoftmax and NLLLoss into a single module: nn.CrossEntropyLoss

Recurrent Neural Networks (RNN)

Feed-Forward Networks look at the input right now in isolation. They have no concept of what happened before

RNNs maintain a "Hidden State" (a short-term memory) that carries over from the previous step. They interpret the current input in the context of the past.

RNNs feed their own output back into itself.

RNNs make decisions based on current input and previous state.

This loop allows the network to carry information forward through time.

Foundation Models & Generative AI

Foundation models are large-scale machine learning models trained on massive and diverse datasets that can be adapted to a wide range of downstream tasks.
GPT-

Generative: creates output text, images, audio, or video

Pre-Trained: Trained on a vast array of content, often obtained from various Internet sources.

Transformer: The model transforms sequences of input tokens into sequences of output tokens.

Main characteristics of foundation models include:

Massive scale

Self-supervised learning

Few-shot / zero-shot learning

Generative models include:

Language models

Image models

Music models

Multimodal models: Combine text, image, video, or audio

Generative AI models use deep neural networks to learn patterns in training data.

Attention mechanism is trained to determine what aspects in the prompt to focus on.

Language models predict the next word in a sentence.

An image model predicts pixel patterns or latent representations that match a given text prompt.

GPT process-

Parse input into tokens

Convert to word embeddings

Add positional encodings

Multi-head attention mechanism

Feed-forward neural network

Evaluation of NLP Systems

Evaluating Generative AI and Question Answering Systems

Unlike classification tasks where a single "correct" answer exists, generative tasks require assessing nuance, factual accuracy, retrieval quality, and safety.

Evaluation requires rigorous testing using a combination of deterministic metrics, LLM-as-a-judge, and human evaluation.

The Shift from Gold Labels to Open-Ended Generation

In traditional machine learning, we compared a prediction to a single "Gold Label" (e.g., Sentiment Analysis: Positive vs Negative ). In Generative NLP, there is no single correct answer.

For example, if the user asks "Explain quantum physics," a 5-year-old’s explanation and a PhD-level explanation are both "correct" but serve different intents.

There is also the hallucination problem where generative models can produce text that is fluent and grammatically perfect but factually wrong.
To evaluate generative AI systems in the context of question answering use cases, we must evaluate two distinct components independently:

Did the model use or find the right source information?

Did the model generate, or write, a good answer based on that information?

Evaluation Metrics

BLEU (BiLingual Evaluation Understudy): Precision-based. Counts how many ngrams in the candidate appear in the reference.

Limitation: Fails to capture synonyms. "The car is fast" and "The vehicle is quick" have low BLEU scores despite identical meaning.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Recall-based. Common in summarization.

ROUGE-L: Measures the longest common subsequence to check for sentence structure similarity.

METEOR: An improvement over BLEU that matches synonyms and stemmed words (e.g., "running" matches "run").

Semantic & Model-Based Metrics (The Modern Standard)

These metrics use embeddings or models to measure meaning rather than exact word matches.
BERTScore: Uses contextual embeddings (like BERT) to calculate the similarity between the generated text and reference text. It is robust to paraphrasing.

Cosine Similarity: Converts both the generated answer and the reference answer into vectors and measures the angle between them.

MQA-Metric: A newer model-based metric specifically designed for QA that prioritizes semantic accuracy over syntactic similarity.

The RAG Evaluation Framework

Context Relevance (Precision of Retrieval)- Did the retrieval system fetch useful information?

Groundedness / Faithfulness (Hallucination Check)- Is the answer supported entirely by the retrieved context? Use an LLM-as-a-Judge to verify that every claim in the generated answer has a citation in the source text.

Answer Relevance (User Satisfaction)- Did the system actually answer the user's question?

LLM-as-a-Judge

This technique can be automated. You use a stronger model to grade the outputs of a smaller model.
Pros: Fast, scalable, and correlates better with human judgment than BLEU/ROUGE.

Cons: Can be biased (LLMs prefer longer answers) and "generous" compared to human experts.

Human Evaluation (The Gold Standard)

Expensive and time consuming.
Expert Review

Crowdsourcing

A/B Testing

Human vs. AI Reviewers

Standard Benchmarks

You should be familiar with the major datasets used to compare models:
MMLU (Massive Multitask Language Understanding): Tests general knowledge across 57 subjects

TruthfulQA: Specifically designed to test if models mimic human falsehoods or generate hallucinations.

HumanEval: A benchmark for code generation capabilities.

GPQA: "Google-Proof" Question Answering – questions so hard they cannot be answered by a simple search