NLP Notes

Definitions

Bigram	2 words. N=2. I am. He is. We are. Used frequently.
Bigram normalization	division of each bigram counts by appropriate unigram counts for w_n-1
Corpus	a collection of written texts
Language model	traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence
Lemma	Basic word shared by word forms. In the dictionary.
Lemmatization	Get the lemma. Slow. Am, is, are, was, were → Be
Morpheme	Smallest meaningful component of a word.
Morphology	Study of words.
N-gram	statistical model consisting of a sequence of N words.
Quadrigram	4 words. N=4.
Sentence	unit of written language; domain specific
Stem	Root of a word. Might not be in the dictionary.
Stemming	Get the words root. Fast. Student, studious, study → Stud
Tokens	can be meaningful words or symbols, punctuations, or distinct characters
Tuple	Ordered, immutable collection of elements.
Trigram	3 words. N=3. I am good. See spot run. We go there. Used rarely.
Types/Word Types	distinct words in a corpus
Unigram	1 word. N=1. Seldom used.
Utterance	unit of spoken language; domain and culture specific
Word Form	another basic entity in a corpus

Abbreviations

ASR	Automatic Speech Recognition
ATN	Augmented Transition Network
BERT	Bidirectional Encoder Representations
CDD	Conceptual Dependency Diagram
CFG	Context Free Grammar
CFL	Context Free Language
CNF	Chomsky Normal Form
CNN	Convolutional Neural Networks
CR	Coreference Resolution
DT	Determiner
FOPC	First Order Predicate Calculus
GPT	Generative Pretrained Transformer
GRU	Gate Recurrent Unit
HMM	Hidden Markov Model
IE	Information Extraction
IR	Information Retrieval
LLM	Large Language Model
LME	Language Model Evaluation
LSTM	Long Short Term Memory
MEMM	Maximum Entropy Markov Model

MLE	Maximum Likelihood Estimation
MeSH	Medical Subject Headings
NER	Named Entity Recognition
NLTK	Natural Language Toolkit
NLU	Natural Language Understanding
NN	Noun
NNP	Proper Noun
Nom	Nominal
NP	Noun Phrase
PCFG	Probabilistic Context Free Grammar
PMI	Point-Wise Mutual Information
POS	Part Of Speech
POST	Part Of Speech Tagging
PPMI	Positive Point-wise Mutual Information
RLHF	Reinforcement Learning from Human Feedback
RNN	Recurrent Neural Network
TBL	Transformation Based Learning
VB	Verb
VP	Verb Phrase
WSD	Word Sense Disambiguation

1. Natural Language Processing

1.1 Introduction

Modern chatbots represent human-computer interaction and require world knowledge and domain knowledge.

Knowledge is organized into knowledge trees or ontology graphs.

NLP is cross-disciplinary integration of philosophy, psychology, linguistics, and computational linguistics.

computational linguistics - multidisciplinary study of epistemology, philosophy, psychology, cognitive science, and agent ontology.

1.2 Human Language and Intelligence

Core technologies and methodologies arose from the Turing Test.

Human language is categorized as written and oral.

1.3 Linguistic Levels of Human Language

6 (7?) levels of linguistics analysis: phonetics phonology, morphology, lexicology, syntax, semantics, and pragmatics.

1.4 Human Language Ambiguity

Lexical ambiguity- words can have different meanings
Syntactic- who is holding the bag?
Semantic- what does "it" mean in a sentence?
Pragmatic- what does the sentence mean?

1.5 A Brief History of NLP

1.5.1 First Stage: Machine Translation (Before the 1960s)

Leibniz and Descartes codified relationships between words and sentences.

Alan Turing wrote Computing Machinery and Intelligence in 1950 which proposed the Turing test.

The Georgetown-IBM experiment translated over 60 Russian sentences in 1957.

Chomsky wrote Syntactic Structures in 1957.

The ALPAC report threw cold water on NLP research in 1966.

1.5.2 Second Stage: Early AI on NLP (1960s-1970s)

The Baseball Q&A program was developed in 1961.

Marvin Minsky wrote Semantic Information Processing in 1968.

William Woods developed augmented transition networks (ATN) in 1970.

1.5.3 Third Stage: Grammatical Logic on NLP (1970s-1980s)

1.5.4 Fourth Stage: AI and Machine Learning (1980s-2000s)

IBM began developing Watson, a DeepQA program which could compete on Jeopardy.

1.5.5 Fifth Stage: Rise of BERT, Transformer, ChatGPT, and LLMs (2000s-Present)

Long short-term memory (LSTM) recurrent neural networks (RNN) became dominant.

Google Brain published Attention Is All You Need which introduced the transformer architecture.

Google introduced BERT.

OpenAI developed ChatGPT which utilizes generative pre-trained transformers.

1.6 NLP and AI

Not informative.

1.7 Main Components of NLP

NLP consists of: Natural Language Understanding (NLU), Knowledge Acquisition and Inferencing (KAI), and Natural Language Generation (NLG).

1.8 Natural Language Understanding (NLU)

NLU is a process of understanding spoken language in four stages: speech recognition, and syntactic, semantic, and pragmatic analysis.

1.8.1 Speech Recognition

Speech recognition is the first stage in NLU that performs phonetic, phonological, and morphological processing to analyze spoken language.

Breaks down the stems of spoken words (utterances) into distinct tokens representing paragraphs, sentences, and words in different parts.

Current speech recognition models use spectrogram analysis.

1.8.2 Syntax Analysis

Reject phrases like, "I you love."

1.8.3 Semantic Analysis

Reject nonsense like, "hot snowflakes".

1.8.4 Pragmatic Analysis

Requires expert knowledge or just common sense.

1.9 Potential Applications of NLP

NLP is used for translation, information extraction (IE), information retrieval (IR), sentiment analysis, and chatbots.

1.9.1 Machine Translation (MT)

Earliest NLP application.

Although it is not difficult to translate one language to another...(?!)

1.9.2 Information Extraction (IE)

Extract key language information from texts or utterances automatically.

1.9.3 Information Retrieval (IR)

Organize, retrieve, store, and evaluate information from documents and multimedia.

1.9.4 Sentiment Analysis

Analyze user sentiment toward products, people, and ideas from social media, forums, and online platforms.

1.9.5 Question-Answering (Q&A) Chatbots

Not informative

Errata

1.5- No mention of Eliza.
1.5.2- Further, Professor William A. Woods proposed an augmented translation network (ATN) to represent natural language input in 1970. (Woods proposed augmented transition networks.)
1.5.2- Minsky didn"t develop an NLP system in 1968. He wrote a book (important).
1.5.4- Hopfield networks did not have an significant impact on NLP.

2. N-Gram Language Model

2.1 Introduction

NLP entities like word-to-word tokenization using NTLK and spaCy technologies in Workshop 1 (Chap. 11) analyzed words in insolation
In many NLP applications, there are noises and disruptions affecting incorrect word pronunciation
Humans experience mental confusion about spelling errors
Word prediction can provdie spell checking and model words relationships
Probability or word counting method can work on a large databank called a corpus

2.2 N-Gram Language Model

The N-gram language model predicts words using probabilities. An N-gram is a statistical model consisting of a word sequence in N number (unigram, bigram, trigram...).

2.2.1 Basic NLP Terminology

Sentence	unit of written language; domain specific
Utterance	unit of spoken language; domain and culture specific
Word Form	another basic entity in a corpus
Types/Word Types	distinct words in a corpus
Tokens	can be meaningful words or symbols, punctuations, or distinct characters
Stem	Root of a word. Might not be in the dictionary.
Stemming	Get the words root. Fast. Student, studious, study → Stud
Lemma	Basic word shared by word forms. In the dictionary.
Lemmatization	Get the lemma. Slow. Am, is, are, was, were → Be

Corpora
Google	Over a trillion English tokens with over a million meaningful wordform types sufficient to generate sentences/utterances for daily use
Brown Corpus	First well-organized corpus. Brown University. 1961. Over 583 million tokens, 293,181 wordform types and foreign words.
Wall Street Journal	Financial domain
Associated Press	International news
Hansard	British parliamentary speeches
BU Broadcast News Corpus
NLTK Corpus

A language model is a traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence.

2.2.2 Language Modeling and Chain Rule

Conditional probability

Probability of A if B.

P(A|B)

Probability of both A and B.

P(A∩B)

Chain rule

P(A|B) =

P(A∩B)

P(B)

2.3 Markov Chain in N-Gram Model

A Markov chain is a process that describes a sequence of possible events where the probability of each event depends only the previous event.

Entire sentences can be modelled as Markov chains.

2.4 Example: The Adventures of Sherlock Holmes

Maximum Likelihood Estimates (MLE) is another method to calculate the N-gram model.

Bigram normalization is the division of bigram counts by unigram counts for w_n-1.

2.5 Shannon’s Method in N-Gram Model

Choose a random N-gram (<s>, w) according to its probability
Now choose a random N-gram (w, x) according to its probability
And so on until we choose </s>
String the words together into a sentence

Constructing sentences from trigrams seems to be most suitable.

2.6 Language Model Evaluation and Smoothing Techniques

Language Model Evaluation (LME) is a standard method to train parameters on a training set and to review model performance with new data constantly.

That often occurred in real world to learn how the models perform called training data (training set) on language model and see whether it works with unseen information called test data (test set).

Can deal with unkown words.

2.6.1 Perplexity

Perplexity (PP) is the probability of the test set assigned by the language model, normalized by the number of words.

minimizing perplexity is the same as maximizing probability for model performance

2.6.2 Extrinsic Evaluation Scheme

Simply compare two models’ application performance regarding N-gram evaluation.

Time consuming

2.6.3 Zero Counts Problems

Many possible bigrams never show up in training data.

Zipf's Law- in a large corpus, the frequency of any word is inversely proportional to its rank, where the most frequent word occurs twice as often as the second, and thrice as often as the third.

2.6.4 Smoothing Techniques

Smoothing techniques compensate for Zipf's Law.

2.6.5 Laplace (Add-One) Smoothing

Add 1 to all possible unigrams and bigrams.

2.6.6 Add-k Smoothing

Add less than zero to rare N-grams.

2.6.7 Backoff and Interpolation Smoothing

2.6.8 Good Turing Smoothing

Errata

2.1- NLP entities like word-to-word tokenization using NTLK and spaCy technologies in Workshop 1 (Chap. 11) analyzed words in insolation, but the relationship between words is important in NLP.
2.1- When applying probability to word prediction in an utterance, there are words often proposed by rank and frequency to provide a sequential optimum estimation.
2.2- It was learned that the motivations for word prediction can apply to voice recognition, text generation, and Q&A chatbot.
2.2.1- A language model is a traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence...
2.6.3- Further, Zipf's law states that a long tail phenomenon is a rare event that occurs in a very high frequency, and large events numbers occur in a low frequency constantly.
Author doesn't explain the difference between the conditional probablity formula and the chain rule.

3. Part-of-Speech (POS) Tagging

3.1 What Is Part of Speech (POS)?

Category of words that have similar grammatic behaviors or properties.
Inflection- items are added to the base form of a word to convey grammatical meanings (eg, cat and cats)

3.1.1 Nine Major POS in the English Language

adjectives
verbs
pronouns
conjunctions
prepositions
articles
adverbs
nouns
interjections

3.2 POS Tagging

3.2.1 What Is POS Tagging in Linguistics?

Labelling a word according to a particular POS based on definition and contexts in linguistics.

3.2.2 What Is POS Tagging in NLP?

Automatic description assignment to words or tokens.

The operation of converting a sentence to forms, or a list of words and list of tuples, where each tuple has a word or tag form to signify POS.

3.2.3 POS Tags Used in the PENN Treebank Project

POS tag databank provided by the PENN Treebank corpus

classifies nine major POS into subclasses that have a total of 45 POS tags

Penn Treebank (PTB) corpus has a comprehensive section of WSJ articles

3.2.4 Why Do We Care About POS in NLP?

3.3 Major Components in NLU

Morphology- understandings of shapes and patterns for every word of a sentence.
POS tagging
Syntax
Semantics
Discourse integration- relationship between different sentences and its contents.

3.3.1 Computational Linguistics and POS

POS tagging can be considered as the fundamental process in computational linguistics

3.3.2 POS and Semantic Meaning

3.3.3 Morphological and Syntactic Definition of POS

3.4 Nine Key POS in English

pronoun
verb
adjective (quick)
adverb (quickly)
interjection (Hey!)
noun
conjunction (and, or, but)
preposition (in, on, at)
article (a, the)

3.4.1 English Word Classes

Two types of English word classes: closed and open .

Closed-class words are also known as functional/grammar words. They are closed since new words are seldom created in the class. (conjunctions, determiners, pronouns, and prepositions)

Open-class words are also known as lexical/content words. They are open because the meaning of open-class words can be found in dictionaries and therefore their meaning can be interpreted individually. (nouns, verbs, adjectives, and adverbs)

3.4.2 What Is a Preposition?

used before nouns

approximately 80-100 prepositions in English

3.4.3 What Is a Conjunction?

Coordinating conjunctions join words, clauses, or phrases of equal grammatic rank. (and, but, for, nor, or yet)

Subordinating conjunctions join independent and dependent clauses to present a relationship (as, although, because, since, though, while, whereas)

3.4.4 What Is a Pronoun?

3.4.5 What Is a Verb?

3.5 Different Types of POS Tagset

3.5.1 What Is Tagset?

A Tagset is a batch of POS tags (POST) to indicate the part of speech and sometimes other grammatical categories such as case and tense

3.5.2 Ambiguous in POS Tags

Noun-verb ambiguity. (record)
Adjective-verb ambiguity. (perfect)
Adjective-noun ambiguity. (complex)

3.5.3 POS Tagging Using Knowledge

four methods to get knowledge from POS tagging

dictionary
morphological rules (capitalization, suffixes...)
N-gram frequencies (next word prediction)
structural relationships combination. (figure it out)

3.6 Approaches for POS Tagging

3 basic approaches to POS Tagging: Rule-based, Stochastic-based, and Hybrid Tagging

3.6.1 Rule-Based Approach POS Tagging

two-stage process: (1) dictionary consists of all possible POS tags (2) words with more than single tag ambiguity applied handwritten or grammatic rules to assign the correct tag(s) according to surrounding words

rule generation can be achieved by (1) hand creation and (2) training from a corpus with machine learning.

3.6.2 Example of Rule-Based POS Tagging

3.6.3 Example of Stochastic-Based POS Tagging

3.6.4 Hybrid Approach for POS Tagging Using Brill’s Taggers

3.6.5 What Is Transformation-Based Learning?

3.6.6 Hybrid POS Tagging: Brill’s Tagger

3.6.7 Learning Brill’s Tagger Transformations

3.7 Taggers Evaluations

3.7.1 How Good Is an POS Tagging Algorithm?

Errata

3.2.2- Tuple is undefined.
3.3.1- As human language is natural and the most polytropic? means of communication...
3.5.2 Ambiguous? in POS Tags
3.6.1- It is a two-stage process: (1) dictionary consists of all possible POS tags for basic concepts of words as abovementioned and (2) words with more than single tag ambiguity applied handwritten or grammatic rules to assign the correct tag(s) according to surrounding words.

4. Syntax and Parsing

4.1 Introduction and Motivation

Grammatic rules are used to create sentences with syntactic rules.

Syntax analysis is used to analyze the structure and the relationship between tokens to create a parse tree.

4.2 Syntax Analysis

4.2.1 What Is Syntax

Syntax is the set of rules that govern how groups of words are combined to form phrases, clauses, and sentences.

Syntax can be defined as the correct arrangement of word tokens in written or spoken sentences

4.2.2 Syntactic Rules

Syntactic rules also described to assist language parts make sense.

4.2.3 Common Syntactic Patterns

There are seven common syntactic patterns:

Subject → Verb

Subject → Verb → Direct Object

Subject → Verb → Subject Complement

Subject → Verb → Adverbial Complement

Subject → Verb → Indirect Object → Direct Object

Subject → Verb → Direct Object → Direct Complement

Subject → Verb → Direct Object → Adverbial Complement

4.2.4 Importance of Syntax and Parsing in NLP

5 major components in Natural Language Understanding (NLU):

Morphology

POS Tagging

Syntax

Semantics

Discourse Integration

4.3 Types of Constituents in Sentences

4.3.1 What Is Constituent?

A constituent is considered as the linguistic component of a language.

Words or phrases that combine into a sentence are constituents.

4.3.2 Kinds of Constituents

noun-phrase

verb-phrase

preposition-phrase

4.3.2.1 Noun Phrase (NP)

consists of a noun and its modifiers

4.3.2.2 Verb Phrase (VP)

consists of a main verb accompanied by linking verbs or modifiers

4.3.3 Complexity on Simple Constituents

Big red handbag makes sense. red big handbag does not.

4.3.4 Verb Phrase Subcategorization

Traditional English grammar classifies verbs into transitive (object) and intransitive subcategories; modern English grammars identify more than 100 subcategories.

4.3.5 The Role of Lexicon in Parsing

A lexicon is the vocabulary of a language or a specific field of knowledge,

Linguists believe that all languages are composed of two major components: (1) lexicon and (2) grammar.

Items within a lexicon are called lexemes, and groups of lexemes are called lemmas, often used to describe the size of a lexicon.

Lexical analysis- understand what words mean and intuit contexts.

A program that performs such lexical analysis is called tokenizer, lexer, or scanner.

A lexer is combined with a parser generally to analyze the syntax of text.

4.3.6 Recursion in Grammar Rules

4.4 Context-Free Grammar (CFG)

4.4.1 What Is Context Free Language (CFL)?

Context-free language (CFL) is a superset of Regular Language (RL) generated by context-free grammar (CFG) which means every RL is a CFL but not all CFL is a RL
CFL is:

Recursively enumerable language as a superset of language model.

Context-sensitive language, a subset of recursively enumerable language.

Subsets of context-sensitive language.

4 levels of human language:

regular

context-free

context-sensitive

recursively enumerable

Most arithmetic expressions generated by a Context-Free Grammar (CFG) are CFLs.

4.4.2 What Is Context Free Grammar (CFG)?

CFG is to describe CFL as a set of recursive rules for generating string patterns.

CFG is commonly applied in linguists and compiler design to describe programming languages and parsers that can be created automatically.

4.4.3 Major Components of CFG

4 major components-

A set of non-terminal symbols N

A set of terminal symbols Σ

A set of production rules P

The designated start symbol S is a start symbol of the sentence.

4.4.4 Derivations Using CFG

4.5 CFG Parsing

3 CFG parsing levels: morphological, phonological, and syntactic

4.5.1 Morphological Parsing

Determines the morphemes of a word being constructed (cats is plural of cat).

4.5.2 Phonological Parsing

Phonological parsing is to interpret sounds into words and phrases to generate parser.

4.5.3 Syntactic Parsing

identify relevant components and correct grammar of a sentence.

4.5.4 Parsing as a Kind of Tree Searching

4.5.5 CFG for Fragment of English

4.5.6 Parse Tree for “Play the Piano” for Prior CFG

4.5.7 Top-Down Parser

4.5.8 Bottom-Up Parser

4.5.9 Control of Parsing

4.5.10 Pros and Cons of Top-Down vs. Bottom-Up Parsing

4.6 Lexical and Probabilistic Parsing

4.6.1 Why Using Probabilities in Parsing?

4.6.2 Semantics with Parsing
4.6.3 What Is PCFG?
A probabilistic context-free grammar (PCFG) is a context-free grammar that associates each of its production rules with a probability.

4.6.4 A Simple Example of PCFG

4.6.5 Using Probabilities for Language Modelling

4.6.6 Limitations for PCFG

4.6.7 The Fix–Lexicalized Parsing

Errata

4.2.2- Syntactic rules also described to assist language parts make sense.
4.3.6- The boy who come early today won the game.
4.5.2- Phonological parsing is to interpret sounds into words and phrases to generate parser.

5 Meaning Representation

5.1 Introduction

This chapter will focus on how to interpret meaning, introducing scientific and logical methods for processing meaning, known as meaning representation.

5.2 What Is Meaning?

Meaning refers to the message conveyed by words, phrases, and sentences or utterances within a given context.

Often referred to as lexical or semantic meaning.

Meanings of sentences are not simply the combination of words' meanings, but usually the phrasal words with specific meanings at the pragmatic level (off the wagon).

Semantic meaning is the study of meaning assignment to minimal meaning-bearing elements.

5.3 Meaning Representations

Unlike parse trees, meanning representations are not primarily a description of the input structure, but a representation of how humans understand, represent anything (such as actions, events, objects, etc.).

5 types of meaning representation:

categories- specific entitities

events

time- specific moment

aspects

Stative- facts

Activity

Accomplishment?

Achievement?

beliefs, desires, and intentions

5.4 Semantic Processing

Semantic processing encodes and interprets meaning.

Reason relations with the environment.

Answer questions based on contents.

Perform inference based on knowledge and determine the verity of unknown facts.

5.5 Common Meaning Representation

4 common methods of meaning representation.

5.5.1 First-Order Predicate Calculus (FOPD)

Also known as predicate logic.

Expresses the relationship between information objects as predicates.

5.5.2 Semantic Networks

Knowledge representation techniques used for propositional information. A semantic net can be represented as a labeled directed graph.

5.5.3 Conceptual Dependency Diagram (CDD)

Theory that describes how sentence/utterance meaning is represented for reasoning.

5.5.4 Frame-Based Representation

Frames and notions as basic components to characterize domain knowledge.

A frame is a knowledge configuration to characterize a concept such as a car or driving a car. (like an object with properties)

5.6 Requirements for Meaning Representation

5.6.1 Verifiability

determining whether a sentence/utterance has a literal meaning

5.6.2 Ambiguity

word, statement, or phrase that consists of more than one meaning

5.6.3 Vagueness

borderline cases (short, tall,...)

5.6.4 Canonical Forms

5.6.4.1 What Is Canonical Form?

Simplest expression. Instead of 0.000, just say 0 .

5.6.4.2 Canonical Form in Meaning Representation

Instead of "Jack likes to eat candy all day.", just say "Jack eats candy."

5.6.4.3 Canonical Forms: Pros and Cons

Advantages- 1)Simplify reasoning and storage operations. 2)No need to generate inference rules for all different variations.

May complicate semantic analysis.

5.7 Inference

5.7.1 What Is Inference?

deduction and induction

Induction goes from specific to general; Deduction goes from general to specific.

Induction builds theories. Deduction tests them.

5.7.2 Example of Inferencing with FOPC

5.8 Fillmore’s Theory of Universal Cases

Case grammar is a linguistic system that focuses on the connection between the quantity such as the subject, object, or valence of a verb and the grammatical context.

Only a limited number of semantic roles (case roles) occur in every sentence constructed with verbs.

5.8.1 What Is Fillmore’s Theory of Universal Cases?

Each verb needs a certain number of case roles to form a case-frame

Thematic role (semantic role) refers to case role that a noun phrase (NP) may deploy with respect to action or state used by the main verb.

5.8.2 Major Case Roles in Fillmore’s Theory

Agent—doer of action, attribute intention.

Experiencer—doer of action without intention.

Theme—thing that undergoes change or being acted upon with

Instrument—tool being used to perform the action.

Beneficiary—person or thing for which the action being acted on or performed to.

To/At/From Loc/Poss/Time—to possess things, place, or time.

5.8.3 Complications in Case Roles

4 types of complications in case role analysis:

Syntactic constituents’ ability to indicate semantic roles in several cases

Syntactic expression option availability

Prepositional ambiguity not always introduces the same role

Role options in a sentence

5.8.3.1 Selectional Restrictions

Selectional restrictions are methods to restrict types of certain roles to be used for semantic consideration.

5.9 First-Order Predicate Calculus

5.9.1 FOPC Representation Scheme

FOPC can be used as a framework to derive semantic representation of a sentence.
FOPC supports:

Reasoning in truth condition analysis to respond yes or no questions.

Variables in general cases through variable binding at responses and storage.

Inference to respond beyond KB storage on new knowledge.

5.9.2 Major Elements of FOPC

terms

constants- specific object described in sentence

functions- concepts expressed as genitives (owner) such as brand name, location,

variables- objects without reference which object is referred to

predicates- property of a subject; usually verbs

connectives- conjunctions, implications(if...then), negations

quantifiers

5.9.3 Predicate-Argument Structure of FOPC

5.9.4 Meaning Representation Problems in FOPC

5.9.5 Inferencing Using FOPC

Inference- validate or prove whether a proposition is true or false from a KB.
Modus Ponens (MP)- If P is true, then Q must also be true. P→Q

Errata

5.3- what's the difference between an accomplishment and an achievement?
5.8- Case grammar is a linguistic system that focuses on the connection between the quantity(?) such as the subject, object, or valence of a verb and the grammatical context.
5.8.1- Jack gets a prize. Statement shows Jack is the agent as he is doer to get, the prize is the object being received, so it is a patient. What is patient?

8. Transfer Learning and Transformer Technology

8.1 What Is Transfer Learning?

Transfer learning (TL) involves solving a problem by leveraging acquired knowledge and applying that knowledge to address another related problem

8.2 Motivation of TL

Traditional ML datasets and trained model parameters cannot be reused.

8.2.1 Categories of TL

Heterogeneous and Homogeneous

8.3 Solutions of TL

8.3.1 Instance-Based Method

reweights samples from source domains

8.3.2 Feature-Based Method

works for both heterogeneous and homogeneous TL problems

Asymmetric feature transformation aims to modify the source domain and reduce the gap between source and target instances by transforming one of the source and target domains to the other

Symmetric feature transformation aims to transform source and target domains into their shared feature space

8.3.3 Parameter-Based Method

transfers learned knowledge by sharing parameters common to the models of source and target learners

trained model is transferred from source domain to target domain with parameters

This approach can train more than one model on the source data and combine parameters learned from all models to improve results of the target learner.

8.3.4 Relational-Based Method

transfers learned knowledge by sharing its learned relations between different sample parts of source and target domains

8.4 Recurrent Neural Network (RNN)

8.4.1 What Is RNN?

The RNN has memory which means its output is influenced by prior elements of the sequence against traditional feedforward neural network (FNN)

8.4.2 Motivation of the RNN

To feel under the weather means unwell. This phrase makes sense only when it is expressed in that specific order.
There are five major categories of RNN architecture corresponding to different tasks:

simple one-to-one model for image classification task

one-to-many model for image captioning tasks

many-to-one model for sentiment analysis tasks

many-to-many models for machine translation

complex many-to-many models for video classification tasks

8.4.3 RNN Architecture

A significant difference between RNN and traditional neural networks is that weights and bias U, W, and V are shared among layers.

Partly recurrent is a layered network with distinct output and input layers where recurrence is limited to the hidden layer. Fully connected recurrent neural network (FRNN) connects all neurons’ outputs to inputs

8.4.4 Long Short-Term Memory (LSTM) Network

LSTM is a type of the RNN with special hidden layers to deal with gradient explosion and disappearance problems during long sequence training process

better performance with training longer sequences against naïve RNNs.

two hidden layers as the RNN where a memory cell in the layer is to replace the hidden node.

LSTM has forget, memory select, and output stages.

8.4.5 Gate Recurrent Unit (GRU)

GRU can be considered as a kind of the RNN like LSTM but to manage backpropagation gradient problem

8.4.6 Bidirectional Recurrent Neural Networks (BRNNs)

BRNN is a type with RNN layers in two directions. It links with previous and subsequent information outputs to perform inference against both RNN and LSTM to possess information from the previous one. BRNN consists of two RNNs superimposed on top of each other. The output is mutually generated by two RNN states.

8.5 Transformer Technology

8.5.1 What Is Transformer?

The Transformer is a network architecture based on the attention mechanism, without relying on recurrent or convolutional units

8.5.2 Transformer Architecture

A transformer model has two parts: (1) encoder and (2) decoder. Language sequence extracts as input, encoder maps it into a hidden layer, and decoder maps the hidden layer inversely to a sequence as output.

8.5.3 Deep into Encoder

8.6 BERT

8.6.1 What Is BERT?

BERT is a pretrained model of language representation called Bidirectional Encoder Representation from Transformers

8.6.2 Architecture of BERT

BERT models are pretrained either by left-to-right or right-to-left language models previously

8.6.3 Training of BERT

BERT has two training process steps: (1) pretraining and (2) fine-tuning.

8.7 Other Related Transformer Technology

8.7.1 Transformer-XL

8.7.2 ALBERT

References

10 Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI)

10.1 Introduction to LLM and GenAI

10.1.1 What Is a Large Language Model (LLM)?

innovative machine learning models designed to learn from textual data, understand language patterns such as grammar, syntax, context, semantics; and process by models’ sophisticated architectures to generate relevant coherent and contextual text; translate languages; summarize content and answer questions in NLP.

Transformers use self-attention mechanisms to weigh the importance of different words in a sentence regardless of their positions to overcome RNNs’ and LSTMs’ limitations.

10.1.2 Understanding Generative Artificial Intelligence (GenAI)

GenAI- AI systems that can generate new content regardless of text, images, music, or other forms of media based on the learnt patterns from vast datasets.

Generative Adversarial Network (GAN) consists of two neural networks (1) a generator to create new data instances and (2) a discriminator to evaluate them against real-world data.

10.1.3 The Intersection of LLM and GenAI

10.1.4 The Importance of LLMs in Modern AI

Applications Across Industries: LLMs can automate complex tasks required by human expertise previously.

Conversational AI and Customer Service

Enhancing Creativity and Content Generation

Multilingual and Cross-Cultural Communication

The Future of Human-Machine Interaction

10.2 Foundations of LLMs

10.2.1 Neural Network Architectures

Multi-Layer Perceptron (MLP) is one of the earliest neural networks

RNNs

LSTMs introduced memory cells to store information over longer time spans

GRUs ( Gated Recurrent Units) simplified LSTM by merging certain gates

Convolutional Neural Networks (CNNs) primarily used in computer vision

10.2.2 Attention Mechanisms

Attention mechanisms have revolutionized NLP by allowing models to selectively focus on relevant parts of the input sequence when making predictions.

The encoder, in traditional sequence-to-sequence models, processes the input sequence into a fixed-size representation

The decoder compels the entire input sequence to a fixed-size vector in dealing with long sentences complications, allowing the decoder “attends” to different parts of the input sequence at each step during the decoding process to assign different weights on each input token

self-attention allowed each token in the sequence attends to every other token in the same sequence.

10.2.3 The Transformer Architecture

Transformer architecture signified a fundamental shift in LLMs design by building upon the self-attention concept and discarded RNNs’ sequential processing nature

core component is a multi-head self-attention mechanism focusing on different parts of the input sequence simultaneously.

outputs of all attention heads are then concatenated and traversed in a feed-forward network to generate the final output

10.2.4 Scaling Up: From BERT to GPT

BERT is a Transformer-based model to understand the contextual relationships between words in a sentence.

BERT proposed bidirectional training to learn from both directions simultaneously.

This approach allowed BERT to capture abundant contextual information and improve performance on a wide range of NLP tasks

training process consists of two major steps: pretraining and fine-tuning.

In pretraining, the model is trained on a large corpus using two unsupervised tasks—masked language modeling (MLM) and next sentence prediction (NSP)

In MLM, random words in a sentence are masked, and the model is tasked with predicting the missing words.

NSP trains the model to understand relationships between sentences.

GPT is a unidirectional model that processed text from left to right

GPT is trained by an auto-regressive approach to predict the next word in a sequence based on the previous words

Larger models not only improve performance on language tasks but also exhibit emergent capabilities that are absent in smaller models.

10.3 Key Players in the LLM Landscape

10.3.1 ChatGPT by OpenAI (Current Version: GPT-4)

breakthrough in LLMs and NLP. These models are designed to generate human-like text by leveraging a Transformer-based architecture and vast datasets. Each generation from GPT-1 to recent GPT-4 has demonstrated increasing complexity, performance, and applicability levels.

pretrained on a diverse corpus of internet data including articles, books, websites, and other content available for public

10.3.2 Pathways Language Model (PaLM) by Google DeepMind (Current Version: PaLM 2)

superseded by Gemini

10.3.3 Large Language Model Meta AI (LLaMA) by Meta (Current Version: LLaMA 2)

One of its motivations is to develop a model that can match or exceed performance like GPT-3’s with lower computational overheads cost.

larger datasets and well-tuned training strategies can be achieved with fewer parameters.

10.3.4 Claude by Anthropic (Current Version: Claude 2)

LLM designed for safety, alignment, and usability

prioritizes ethical considerations

Constitutional AI Framework: This approach embeds ethical constraints into the model’s learning process. Claude is trained using a set of guiding principles (a “constitution”) to evaluate and correct responses without extensive post-deployment modification.

10.3.5 ERNIE 3.0 Titan by Baidu

trained in Chinese and English

combines auto-regressive (like GPT) or auto-encoding (like BERT) architecture

10.4 Applications of LLMs in GenAI

10.4.1 Creative Writing and Content Generation

10.4.2 Language Translation

10.4.3 Conversational AI and Chatbots

10.4.4 Text Summarization and Content Curation

10.5 Ethical Considerations and Challenges

10.5.1 Detecting and Mitigating Bias

begins with conscientious curation of training data.

algorithmic fairness techniques such as adversarial training, where a secondary model is used to detect and correct biased outputs from the primary model

Human-in-the-loop systems

10.5.2 Privacy and Data Security

10.5.3 The Spread of Misinformation

Verifying information generated by black-box LLMs is particularly challenging because they do not provide sources for their outputs.

10.5.4 Ethical Guidelines for LLM Deployment

10.6 Future Outlook and Research

10.6.1 Current Trends in LLMs and GenAI

Multimodal models are designed to integrate text, images, audio, video, and progress beyond just textual data.

LLMs have advanced significantly from GPT-2 with 1.5 billion parameters to GPT-4 with hundreds of billions of parameters.

Specialized LLMs are domain-specific models that can be fine-tuned for industries such as medicine, law, or finance.

Few-shot and zero-shot learning refer to LLMs’ capabilities to perform tasks with minimal or no task-specific data.

10.6.2 The Future of Creativity in AI

10.6.3 The Role of LLMs in AI Ethics

10.6.4 The Path Forward: Research and Development

Errata

10.1.1- It generates human-like text ranging from essays composition to code snippets creation with over 175 billion parameters to capture intricate linguistic nuances than lesser models’ endeavors.

Fig 10.1- What does the vertical axis mean?
10.2.1-

Neural networks are the foundation of machine learning and NLP in LLMs development to mimic the cognitive functions of the human brain.

Multi-Layer Perceptron (MLP) is one of the earliest neural networks restrained by capturing temporal dependencies in data for language tasks.

Convolutional Neural Networks (CNNs) primarily used in computer vision tasks have accessed to NLP through sentence classification and character-level modeling applications

14. Workshop #4 Semantic Analysis and Word Vectors Using spaCy

14.1 Introduction

14.2 What Are Word Vectors?

A word vector is a dense representation of a word.

14.3 Understanding Word Vectors

Or word2vec
14.3.1 Example: A Simple Word Vector

14.4 A Taste of Word Vectors

14.5 Analogies and Vector Operations

14.6 How to Create Word Vectors?

word2vec (Google, 2013)
GloVe (Stanford, 2014)
fastText (Facebook, 2015)

14.7 spaCy Pretrained Word Vectors

14.8 Similarity Method in Semantic Analysis

14.9 Advanced Semantic Similarity Methods with spaCy

14.9.1 Understanding Semantic Similarity

14.9.2 Euclidean Distance

Distance between 2 points on a graph.
√(x₁-x₂)² + (y₁-y₂)²

14.9.3 Cosine Distance and Cosine Similarity

Cosine distance is more concerned with the orientation (angles) of two vectors in the space.

Similarity score ranges from 0 to 1.

14.9.4 Categorizing Text with Semantic Similarity

14.9.5 Extracting Key Phrases

A noun phrase (NP) is a group of words that consist of a noun and its modifiers. Modifiers are usually pronouns, adjectives, and determiners.

14.9.6 Extracting and Comparing Named Entities

Errata

14.3- Word vectors, or word2vec are important quantity units in statistical methods to represent text in statistical NLP algorithms.- According to Wikipedia, word2vec is a algorithm for creating vectors via neural networks. Not statistical models. It was developed at Google in 2013. The author mentions this in 14.6.
I don't think the author does a good job of explaining word vector matrices. He points out that each row represents a word, but doesn't explain what the columns represent. I found better explantions at dzone and acolyer.com
Spacy is introduced in ch12.
Similar words do not assign with similar vectors resulting unmeaningful vectors...
14.6
- word2vec is a name of statistical algorithm created by Google to produce word vectors. word2vec creates models that train neural nets.
- GloVe is short for Global Vectors. Not Glove vectors.

15 Workshop #5: Sentiment Analysis and Text Classification (Hour 9–10)

15.1 Introduction

NLTK and spaCy are two major NLP Python implementation tools for basic text processing, N-gram modeling, POS tagging, and semantic analysis

Study text classification concepts in NLP and how spaCy NLP pipeline works on text classifier training.

Use movie reviews as a problem domain to demonstrate how to implement sentiment analysis with spaCy.

Introduce Artificial Neural Networks (ANN) concepts, TensorFlow, and Kera technologies.

Introduce sequential modeling scheme with LSTM technology using movie reviews domain as example to integrate these technologies for text classification and movie sentiment analysis.

15.2 Text Classification with spaCy and LSTM Technology

TextCategorizer is a spaCy‘s text classifier component applied in dataset for sentiment analysis to perform text classification with two vital Python frameworks (1) TensorFlow Keras API and (2) spaCy technology.

15.3 Technical Requirements

15.4 Text Classification in a Nutshell

15.4.1 What Is Text Classification?

Text Classification is the task of assigning a set of predefined labels to text.

Language detection is the first step of many NLP systems, i.e., machine translation.

Topic generation and detection are the process of summarization, or classification of a batch of sentences into certain topic of interest (TOI) or topic titles

Sentiment analysis to classify or analyze users’ responses, comments, and messages on a particular topic attribute to positive, neutral, or negative sentiments.

15.4.2 Text Classification as AI Applications

Text classification is considered as supervised-learning (SL) task

Binary text classification refers to categorizing text into two classes.

Multi-class text classification refers to categorizing texts with more than two classes.

Multi-label text classification system is to generalize its multi-class counterpart assigned to each example text (e.g., toxic, severe toxic, threat)

15.5 Text Classifier with spaCy NLP Pipeline

TextCategorizer (tCategorizer) is spaCy‘s text classifier component. It required class labels and examples in NLP pipeline to perform training procedure.

15.5.1 TextCategorizer Class

TextCategorizer consists of (1) single-label and (2) multi-label classifiers.

15.5.2 Formatting Training Data for the TextCategorizer

15.5.3 System Training

15.5.4 System Testing

15.5.5 Training TextCategorizer for Multi-Label Classification

15.6 Sentiment Analysis with spaCy

15.6.1 IMDB Large Movie Review Dataset

15.6.2 Explore the Dataset

15.6.3 Training the TextClassfier

15.7 Artificial Neural Network in a Nutshell

This workshop section will learn how to incorporate spaCy technology with ANN technology using TensorFlow and its Keras package.
A typical ANN has

Input layer consists of input neurons, or nodes

Hidden layer consists of hidden neurons, or nodes

Output layer consists of output neurons, or nodes

15.8 An Overview of TensorFlow and Keras

TensorFlow is a popular Python tool widely used for machine learning.

Kerasis a Python-based deep learning tool that can be integrated with Python platforms such as TensorFlow, Theano, and CNTK.

15.9 Sequential Modeling with LSTM Technology

Keras has extensive support for RNN variations GRU, LSTM and simple API for training RNNs.

15.10 Keras Tokenizer in NLP

Tokens are vectorized by the following steps:

Tokenize each utterance and turn these utterances into a sequence of tokens.

Build a vocabulary from set of tokens presented in Step 1. These are tokens to be recognized by neural network design.

Create a vocabulary and assign ID to each token.

Map token vectors with corresponding token-IDs.

15.10.1 Embedding Words

Tokens can be transformed into token vectors. Embedding tokens into vectors occurred via a lookup embedding table.

Each utterance is divided into tokens and built a vocabulary with Keras‘Tokenizer.

The Tokenizer object held a token index with a token->token-ID mapping.

When a token-ID is obtained, lookup to embedding table rows with this token-ID to acquire a token vector.

This token vector is fed to neural network.

15.11 Movie Sentiment Analysis with LTSM Using Keras and spaCy

This section will demonstrate the design of LSTM-based RNN text classifier for sentiment analysis with steps below:

Data retrieval and preprocessing.

Tokenize review utterances with padding.

Create utterances pad sequence and put it into input layer.

Vectorize each token and verify by token-ID in embedding layer.

Input token vectors into LSTM.

Train LSTM network.

Step 1: Dataset

Step 2: Data and vocabulary preparation

Step 3: Implement the Input Layer

Step 4: Implement the Embedding Layer

Step 5: Implement the LSTM Layer

Step 6: Implement the Output Layer

Step 7: System Compilation

Step 8: Model Fitting and Experiment Evaluation

1.1 Introduction

1.2 Human Language and Intelligence

1.3 Linguistic Levels of Human Language

1.4 Human Language Ambiguity

1.5.1 First Stage: Machine Translation (Before the 1960s)

1.5.2 Second Stage: Early AI on NLP (1960s-1970s)

1.5.3 Third Stage: Grammatical Logic on NLP (1970s-1980s)

1.5.4 Fourth Stage: AI and Machine Learning (1980s-2000s)

1.5.5 Fifth Stage: Rise of BERT, Transformer, ChatGPT, and LLMs (2000s-Present)

1.6 NLP and AI

1.7 Main Components of NLP

1.8 Natural Language Understanding (NLU)

1.8.1 Speech Recognition

1.8.2 Syntax Analysis

1.8.3 Semantic Analysis

1.8.4 Pragmatic Analysis

1.9 Potential Applications of NLP

1.9.1 Machine Translation (MT)

1.9.2 Information Extraction (IE)

1.9.3 Information Retrieval (IR)

1.9.4 Sentiment Analysis

1.9.5 Question-Answering (Q&A) Chatbots

Errata

2.1 Introduction

2.2 N-Gram Language Model

2.2.1 Basic NLP Terminology

2.2.2 Language Modeling and Chain Rule

2.3 Markov Chain in N-Gram Model

2.4 Example: The Adventures of Sherlock Holmes

2.5 Shannon’s Method in N-Gram Model

2.6 Language Model Evaluation and Smoothing Techniques

2.6.1 Perplexity

2.6.2 Extrinsic Evaluation Scheme

2.6.3 Zero Counts Problems

2.6.4 Smoothing Techniques

2.6.5 Laplace (Add-One) Smoothing

2.6.6 Add-k Smoothing

2.6.7 Backoff and Interpolation Smoothing

2.6.8 Good Turing Smoothing

Errata

3.1 What Is Part of Speech (POS)?

3.1.1 Nine Major POS in the English Language

3.2 POS Tagging

3.2.1 What Is POS Tagging in Linguistics?

3.2.2 What Is POS Tagging in NLP?

3.2.3 POS Tags Used in the PENN Treebank Project

3.2.4 Why Do We Care About POS in NLP?

3.3 Major Components in NLU

3.3.1 Computational Linguistics and POS

3.3.2 POS and Semantic Meaning

3.3.3 Morphological and Syntactic Definition of POS

3.4 Nine Key POS in English

3.4.1 English Word Classes

3.4.2 What Is a Preposition?

3.4.3 What Is a Conjunction?

3.4.4 What Is a Pronoun?

3.4.5 What Is a Verb?

3.5 Different Types of POS Tagset

3.5.1 What Is Tagset?

3.5.2 Ambiguous in POS Tags

3.5.3 POS Tagging Using Knowledge

3.6 Approaches for POS Tagging

3.6.1 Rule-Based Approach POS Tagging

3.6.2 Example of Rule-Based POS Tagging

3.6.3 Example of Stochastic-Based POS Tagging

3.6.4 Hybrid Approach for POS Tagging Using Brill’s Taggers

3.6.5 What Is Transformation-Based Learning?

3.6.6 Hybrid POS Tagging: Brill’s Tagger

3.6.7 Learning Brill’s Tagger Transformations

3.7 Taggers Evaluations

3.7.1 How Good Is an POS Tagging Algorithm?

Errata

4.​1 Introduction and Motivation

4.​2 Syntax Analysis

4.​2.​1 What Is Syntax

4.​2.​2 Syntactic Rules

4.​2.​3 Common Syntactic Patterns

4.​2.​4 Importance of Syntax and Parsing in NLP

4.​3 Types of Constituents in Sentences

4.​3.​1 What Is Constituent?​

4.1 Introduction and Motivation

4.2 Syntax Analysis

4.2.1 What Is Syntax

4.2.2 Syntactic Rules

4.2.3 Common Syntactic Patterns

4.2.4 Importance of Syntax and Parsing in NLP

4.3 Types of Constituents in Sentences

4.3.1 What Is Constituent?

4.3.2 Kinds of Constituents

4.3.3 Complexity on Simple Constituents

4.3.4 Verb Phrase Subcategorization

4.3.5 The Role of Lexicon in Parsing

4.3.6 Recursion in Grammar Rules

4.4 Context-Free Grammar (CFG)

4.4.1 What Is Context Free Language (CFL)?

4.4.2 What Is Context Free Grammar (CFG)?

4.4.3 Major Components of CFG

4.4.4 Derivations Using CFG

4.5 CFG Parsing

4.5.1 Morphological Parsing

4.5.2 Phonological Parsing

4.5.3 Syntactic Parsing

4.5.4 Parsing as a Kind of Tree Searching

4.5.5 CFG for Fragment of English

4.5.6 Parse Tree for “Play the Piano” for Prior CFG

4.5.7 Top-Down Parser

4.5.8 Bottom-Up Parser

4.5.9 Control of Parsing

4.5.10 Pros and Cons of Top-Down vs. Bottom-Up Parsing

4.6 Lexical and Probabilistic Parsing

4.6.1 Why Using Probabilities in Parsing?

4.6.2 Semantics with Parsing

4.6.4 A Simple Example of PCFG

4.6.5 Using Probabilities for Language Modelling

4.6.6 Limitations for PCFG

4.6.7 The Fix–Lexicalized Parsing

5.1 Introduction

5.2 What Is Meaning?

5.3 Meaning Representations

5.4 Semantic Processing

5.5 Common Meaning Representation

5.5.1 First-Order Predicate Calculus (FOPD)

5.5.2 Semantic Networks

5.5.3 Conceptual Dependency Diagram (CDD)

5.5.4 Frame-Based Representation

5.6 Requirements for Meaning Representation

5.6.1 Verifiability

5.6.2 Ambiguity

5.6.3 Vagueness

5.6.4 Canonical Forms

5.7 Inference

5.7.1 What Is Inference?

5.7.2 Example of Inferencing with FOPC

5.8 Fillmore’s Theory of Universal Cases

5.8.1 What Is Fillmore’s Theory of Universal Cases?

5.8.2 Major Case Roles in Fillmore’s Theory

5.8.3 Complications in Case Roles

5.9 First-Order Predicate Calculus

5.9.1 FOPC Representation Scheme

5.9.2 Major Elements of FOPC

5.9.3 Predicate-Argument Structure of FOPC

5.9.4 Meaning Representation Problems in FOPC

5.9.5 Inferencing Using FOPC

8.1 What Is Transfer Learning?

8.2 Motivation of TL

8.2.1 Categories of TL

8.3 Solutions of TL

8.3.1 Instance-Based Method

8.3.2 Feature-Based Method