These are notes from Natural Language Processing: A Textbook with Python Implementation by Raymond Lee.

Definitions
Bigram2 words. N=2. I am. He is. We are. Used frequently.
Bigram normalizationdivision of each bigram counts by appropriate unigram counts for wn-1
Corpusa collection of written texts
Language modeltraditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence
LemmaBasic word shared by word forms. In the dictionary.
LemmatizationGet the lemma. Slow. Am, is, are, was, were → Be
MorphemeSmallest meaningful component of a word.
MorphologyStudy of words.
N-gramstatistical model consisting of a sequence of N words.
Quadrigram4 words. N=4.
Sentenceunit of written language; domain specific
StemRoot of a word. Might not be in the dictionary.
StemmingGet the words root. Fast. Student, studious, study → Stud
Tokenscan be meaningful words or symbols, punctuations, or distinct characters
TupleOrdered, immutable collection of elements.
Trigram3 words. N=3. I am good. See spot run. We go there. Used rarely.
Types/Word Typesdistinct words in a corpus
Unigram1 word. N=1. Seldom used.
Utteranceunit of spoken language; domain and culture specific
Word Formanother basic entity in a corpus
Abbreviations
ASRAutomatic Speech Recognition
ATNAugmented Transition Network
BERTBidirectional Encoder Representations
CDDConceptual Dependency Diagram
CFGContext Free Grammar
CFLContext Free Language
CNFChomsky Normal Form
CNNConvolutional Neural Networks
CRCoreference Resolution
DTDeterminer
FOPCFirst Order Predicate Calculus
GPTGenerative Pretrained Transformer
GRUGate Recurrent Unit
HMMHidden Markov Model
IEInformation Extraction
IRInformation Retrieval
LLMLarge Language Model
LMELanguage Model Evaluation
LSTMLong Short Term Memory
MEMMMaximum Entropy Markov Model
MLEMaximum Likelihood Estimation
MeSHMedical Subject Headings
NERNamed Entity Recognition
NLTKNatural Language Toolkit
NLUNatural Language Understanding
NNNoun
NNPProper Noun
NomNominal
NPNoun Phrase
PCFGProbabilistic Context Free Grammar
PMIPoint-Wise Mutual Information
POSPart Of Speech
POSTPart Of Speech Tagging
PPMIPositive Point-wise Mutual Information
RLHFReinforcement Learning from Human Feedback
RNNRecurrent Neural Network
TBLTransformation Based Learning
VBVerb
VPVerb Phrase
WSDWord Sense Disambiguation
1. Natural Language Processing

1.1 Introduction

Modern chatbots represent human-computer interaction and require world knowledge and domain knowledge.

Knowledge is organized into knowledge trees or ontology graphs.

NLP is cross-disciplinary integration of philosophy, psychology, linguistics, and computational linguistics.

computational linguistics - multidisciplinary study of epistemology, philosophy, psychology, cognitive science, and agent ontology.

1.2 Human Language and Intelligence

Core technologies and methodologies arose from the Turing Test.

Human language is categorized as written and oral.

1.3 Linguistic Levels of Human Language

6 (7?) levels of linguistics analysis: phonetics phonology, morphology, lexicology, syntax, semantics, and pragmatics.

1.4 Human Language Ambiguity

  • Lexical ambiguity- words can have different meanings
  • Syntactic- who is holding the bag?
  • Semantic- what does "it" mean in a sentence?
  • Pragmatic- what does the sentence mean?
1.5 A Brief History of NLP

1.5.1 First Stage: Machine Translation (Before the 1960s)

Leibniz and Descartes codified relationships between words and sentences.

Alan Turing wrote Computing Machinery and Intelligence in 1950 which proposed the Turing test.

The Georgetown-IBM experiment translated over 60 Russian sentences in 1957.

Chomsky wrote Syntactic Structures in 1957.

The ALPAC report threw cold water on NLP research in 1966.

1.5.2 Second Stage: Early AI on NLP (1960s-1970s)

The Baseball Q&A program was developed in 1961.

Marvin Minsky wrote Semantic Information Processing in 1968.

William Woods developed augmented transition networks (ATN) in 1970.

1.5.3 Third Stage: Grammatical Logic on NLP (1970s-1980s)

1.5.4 Fourth Stage: AI and Machine Learning (1980s-2000s)

IBM began developing Watson, a DeepQA program which could compete on Jeopardy.

1.5.5 Fifth Stage: Rise of BERT, Transformer, ChatGPT, and LLMs (2000s-Present)

Long short-term memory (LSTM) recurrent neural networks (RNN) became dominant.

Google Brain published Attention Is All You Need which introduced the transformer architecture.

Google introduced BERT.

OpenAI developed ChatGPT which utilizes generative pre-trained transformers.

1.6 NLP and AI

Not informative.

1.7 Main Components of NLP

NLP consists of: Natural Language Understanding (NLU), Knowledge Acquisition and Inferencing (KAI), and Natural Language Generation (NLG).

1.8 Natural Language Understanding (NLU)

NLU is a process of understanding spoken language in four stages: speech recognition, and syntactic, semantic, and pragmatic analysis.

1.8.1 Speech Recognition

Speech recognition is the first stage in NLU that performs phonetic, phonological, and morphological processing to analyze spoken language.

Breaks down the stems of spoken words (utterances) into distinct tokens representing paragraphs, sentences, and words in different parts.

Current speech recognition models use spectrogram analysis.

1.8.2 Syntax Analysis

Reject phrases like, "I you love."

1.8.3 Semantic Analysis

Reject nonsense like, "hot snowflakes".

1.8.4 Pragmatic Analysis

Requires expert knowledge or just common sense.

1.9 Potential Applications of NLP

NLP is used for translation, information extraction (IE), information retrieval (IR), sentiment analysis, and chatbots.

1.9.1 Machine Translation (MT)

Earliest NLP application.

Although it is not difficult to translate one language to another...(?!)

1.9.2 Information Extraction (IE)

Extract key language information from texts or utterances automatically.

1.9.3 Information Retrieval (IR)

Organize, retrieve, store, and evaluate information from documents and multimedia.

1.9.4 Sentiment Analysis

Analyze user sentiment toward products, people, and ideas from social media, forums, and online platforms.

1.9.5 Question-Answering (Q&A) Chatbots

Not informative

Errata

2. N-Gram Language Model

2.1 Introduction

2.2 N-Gram Language Model

The N-gram language model predicts words using probabilities. An N-gram is a statistical model consisting of a word sequence in N number (unigram, bigram, trigram...).

2.2.1 Basic NLP Terminology

Sentenceunit of written language; domain specific
Utteranceunit of spoken language; domain and culture specific
Word Formanother basic entity in a corpus
Types/Word Typesdistinct words in a corpus
Tokenscan be meaningful words or symbols, punctuations, or distinct characters
StemRoot of a word. Might not be in the dictionary.
StemmingGet the words root. Fast. Student, studious, study → Stud
LemmaBasic word shared by word forms. In the dictionary.
LemmatizationGet the lemma. Slow. Am, is, are, was, were → Be
Corpora
GoogleOver a trillion English tokens with over a million meaningful wordform types sufficient to generate sentences/utterances for daily use
Brown CorpusFirst well-organized corpus. Brown University. 1961. Over 583 million tokens, 293,181 wordform types and foreign words.
Wall Street JournalFinancial domain
Associated PressInternational news
HansardBritish parliamentary speeches
BU Broadcast News Corpus
NLTK Corpus

A language model is a traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence.

2.2.2 Language Modeling and Chain Rule

Conditional probability
Probability of A if B.
P(A|B)
Probability of both A and B.
P(A∩B)
Chain rule
P(A|B) =
P(A∩B)

P(B)

2.3 Markov Chain in N-Gram Model

A Markov chain is a process that describes a sequence of possible events where the probability of each event depends only the previous event.

Entire sentences can be modelled as Markov chains.

2.4 Example: The Adventures of Sherlock Holmes

Maximum Likelihood Estimates (MLE) is another method to calculate the N-gram model.

Bigram normalization is the division of bigram counts by unigram counts for wn-1.

2.5 Shannon’s Method in N-Gram Model

  1. Choose a random N-gram (<s>, w) according to its probability
  2. Now choose a random N-gram (w, x) according to its probability
  3. And so on until we choose </s>
  4. String the words together into a sentence

Constructing sentences from trigrams seems to be most suitable.

2.6 Language Model Evaluation and Smoothing Techniques

Language Model Evaluation (LME) is a standard method to train parameters on a training set and to review model performance with new data constantly.

That often occurred in real world to learn how the models perform called training data (training set) on language model and see whether it works with unseen information called test data (test set).

Can deal with unkown words.

2.6.1 Perplexity

Perplexity (PP) is the probability of the test set assigned by the language model, normalized by the number of words.

minimizing perplexity is the same as maximizing probability for model performance

2.6.2 Extrinsic Evaluation Scheme

Simply compare two models’ application performance regarding N-gram evaluation.

Time consuming

2.6.3 Zero Counts Problems

Many possible bigrams never show up in training data.

Zipf's Law- in a large corpus, the frequency of any word is inversely proportional to its rank, where the most frequent word occurs twice as often as the second, and thrice as often as the third.

2.6.4 Smoothing Techniques

Smoothing techniques compensate for Zipf's Law.

2.6.5 Laplace (Add-One) Smoothing

Add 1 to all possible unigrams and bigrams.

2.6.6 Add-k Smoothing

Add less than zero to rare N-grams.

2.6.7 Backoff and Interpolation Smoothing

2.6.8 Good Turing Smoothing

Errata

3. Part-of-Speech (POS) Tagging

3.1 What Is Part of Speech (POS)?

  • Category of words that have similar grammatic behaviors or properties.
  • Inflection- items are added to the base form of a word to convey grammatical meanings (eg, cat and cats)

3.1.1 Nine Major POS in the English Language

  1. adjectives
  2. verbs
  3. pronouns
  4. conjunctions
  5. prepositions
  6. articles
  7. adverbs
  8. nouns
  9. interjections

3.2 POS Tagging

3.2.1 What Is POS Tagging in Linguistics?

Labelling a word according to a particular POS based on definition and contexts in linguistics.

3.2.2 What Is POS Tagging in NLP?

Automatic description assignment to words or tokens.

The operation of converting a sentence to forms, or a list of words and list of tuples, where each tuple has a word or tag form to signify POS.

3.2.3 POS Tags Used in the PENN Treebank Project

POS tag databank provided by the PENN Treebank corpus

classifies nine major POS into subclasses that have a total of 45 POS tags

Penn Treebank (PTB) corpus has a comprehensive section of WSJ articles

3.2.4 Why Do We Care About POS in NLP?

3.3 Major Components in NLU

  1. Morphology- understandings of shapes and patterns for every word of a sentence.
  2. POS tagging
  3. Syntax
  4. Semantics
  5. Discourse integration- relationship between different sentences and its contents.

3.3.1 Computational Linguistics and POS

POS tagging can be considered as the fundamental process in computational linguistics

3.3.2 POS and Semantic Meaning

3.3.3 Morphological and Syntactic Definition of POS

3.4 Nine Key POS in English

  1. pronoun
  2. verb
  3. adjective (quick)
  4. adverb (quickly)
  5. interjection (Hey!)
  6. noun
  7. conjunction (and, or, but)
  8. preposition (in, on, at)
  9. article (a, the)

3.4.1 English Word Classes

Two types of English word classes: closed and open .

Closed-class words are also known as functional/grammar words. They are closed since new words are seldom created in the class. (conjunctions, determiners, pronouns, and prepositions)

Open-class words are also known as lexical/content words. They are open because the meaning of open-class words can be found in dictionaries and therefore their meaning can be interpreted individually. (nouns, verbs, adjectives, and adverbs)

3.4.2 What Is a Preposition?

used before nouns

approximately 80-100 prepositions in English

3.4.3 What Is a Conjunction?

Coordinating conjunctions join words, clauses, or phrases of equal grammatic rank. (and, but, for, nor, or yet)

Subordinating conjunctions join independent and dependent clauses to present a relationship (as, although, because, since, though, while, whereas)

3.4.4 What Is a Pronoun?

3.4.5 What Is a Verb?

3.5 Different Types of POS Tagset

3.5.1 What Is Tagset?

A Tagset is a batch of POS tags (POST) to indicate the part of speech and sometimes other grammatical categories such as case and tense

3.5.2 Ambiguous in POS Tags

  • Noun-verb ambiguity. (record)
  • Adjective-verb ambiguity. (perfect)
  • Adjective-noun ambiguity. (complex)

3.5.3 POS Tagging Using Knowledge

four methods to get knowledge from POS tagging
  1. dictionary
  2. morphological rules (capitalization, suffixes...)
  3. N-gram frequencies (next word prediction)
  4. structural relationships combination. (figure it out)

3.6 Approaches for POS Tagging

3 basic approaches to POS Tagging: Rule-based, Stochastic-based, and Hybrid Tagging

3.6.1 Rule-Based Approach POS Tagging

two-stage process: (1) dictionary consists of all possible POS tags (2) words with more than single tag ambiguity applied handwritten or grammatic rules to assign the correct tag(s) according to surrounding words

rule generation can be achieved by (1) hand creation and (2) training from a corpus with machine learning.

3.6.2 Example of Rule-Based POS Tagging

3.6.3 Example of Stochastic-Based POS Tagging

3.6.4 Hybrid Approach for POS Tagging Using Brill’s Taggers

3.6.5 What Is Transformation-Based Learning?

3.6.6 Hybrid POS Tagging: Brill’s Tagger

3.6.7 Learning Brill’s Tagger Transformations

3.7 Taggers Evaluations

3.7.1 How Good Is an POS Tagging Algorithm?

Errata

4. Syntax and Parsing

4.​1 Introduction and Motivation

Grammatic rules are used to create sentences with syntactic rules.

Syntax analysis is used to analyze the structure and the relationship between tokens to create a parse tree.

4.​2 Syntax Analysis

4.​2.​1 What Is Syntax

Syntax is the set of rules that govern how groups of words are combined to form phrases, clauses, and sentences.

Syntax can be defined as the correct arrangement of word tokens in written or spoken sentences

4.​2.​2 Syntactic Rules

Syntactic rules also described to assist language parts make sense.

4.​2.​3 Common Syntactic Patterns

There are seven common syntactic patterns:
  1. Subject → Verb
  2. Subject → Verb → Direct Object
  3. Subject → Verb → Subject Complement
  4. Subject → Verb → Adverbial Complement
  5. Subject → Verb → Indirect Object → Direct Object
  6. Subject → Verb → Direct Object → Direct Complement
  7. Subject → Verb → Direct Object → Adverbial Complement

4.​2.​4 Importance of Syntax and Parsing in NLP

5 major components in Natural Language Understanding (NLU):
  1. Morphology
  2. POS Tagging
  3. Syntax
  4. Semantics
  5. Discourse Integration

4.​3 Types of Constituents in Sentences

4.​3.​1 What Is Constituent?​

A constituent is considered as the linguistic component of a language.

Words or phrases that combine into a sentence are constituents.

4.​3.​2 Kinds of Constituents

  1. noun-phrase
  2. verb-phrase
  3. preposition-phrase

4.3.2.1 Noun Phrase (NP)

consists of a noun and its modifiers

4.3.2.2 Verb Phrase (VP)

consists of a main verb accompanied by linking verbs or modifiers

4.​3.​3 Complexity on Simple Constituents

Big red handbag makes sense. red big handbag does not.

4.​3.​4 Verb Phrase Subcategorizatio​n

Traditional English grammar classifies verbs into transitive (object) and intransitive subcategories; modern English grammars identify more than 100 subcategories.

4.​3.​5 The Role of Lexicon in Parsing

A lexicon is the vocabulary of a language or a specific field of knowledge,

Linguists believe that all languages are composed of two major components: (1) lexicon and (2) grammar.

Items within a lexicon are called lexemes, and groups of lexemes are called lemmas, often used to describe the size of a lexicon.

Lexical analysis- understand what words mean and intuit contexts.

A program that performs such lexical analysis is called tokenizer, lexer, or scanner.

A lexer is combined with a parser generally to analyze the syntax of text.

4.​3.​6 Recursion in Grammar Rules

4.​4 Context-Free Grammar (CFG)

4.​4.​1 What Is Context Free Language (CFL)?​

Context-free language (CFL) is a superset of Regular Language (RL) generated by context-free grammar (CFG) which means every RL is a CFL but not all CFL is a RL

CFL is:
  1. Recursively enumerable language as a superset of language model.
  2. Context-sensitive language, a subset of recursively enumerable language.
  3. Subsets of context-sensitive language.
4 levels of human language:
  1. regular
  2. context-free
  3. context-sensitive
  4. recursively enumerable

Most arithmetic expressions generated by a Context-Free Grammar (CFG) are CFLs.

4.​4.​2 What Is Context Free Grammar (CFG)?​

CFG is to describe CFL as a set of recursive rules for generating string patterns.

CFG is commonly applied in linguists and compiler design to describe programming languages and parsers that can be created automatically.

4.​4.​3 Major Components of CFG

4 major components-
  1. A set of non-terminal symbols N
  2. A set of terminal symbols Σ
  3. A set of production rules P
  4. The designated start symbol S is a start symbol of the sentence.

4.​4.​4 Derivations Using CFG

4.​5 CFG Parsing

3 CFG parsing levels: morphological, phonological, and syntactic

4.​5.​1 Morphological Parsing

Determines the morphemes of a word being constructed (cats is plural of cat).

4.​5.​2 Phonological Parsing

Phonological parsing is to interpret sounds into words and phrases to generate parser.

4.​5.​3 Syntactic Parsing

identify relevant components and correct grammar of a sentence.

4.​5.​4 Parsing as a Kind of Tree Searching

4.​5.​5 CFG for Fragment of English

4.​5.​6 Parse Tree for “Play the Piano” for Prior CFG

4.​5.​7 Top-Down Parser

4.​5.​8 Bottom-Up Parser

4.​5.​9 Control of Parsing

4.​5.​10 Pros and Cons of Top-Down vs.​ Bottom-Up Parsing

4.​6 Lexical and Probabilistic Parsing

4.​6.​1 Why Using Probabilities in Parsing?​

4.​6.​2 Semantics with Parsing

4.​6.​3 What Is PCFG?​
A probabilistic context-free grammar (PCFG) is a context-free grammar that associates each of its production rules with a probability.

4.​6.​4 A Simple Example of PCFG

4.​6.​5 Using Probabilities for Language Modelling

4.​6.​6 Limitations for PCFG

4.​6.​7 The Fix–Lexicalized Parsing

Errata

5 Meaning Representation

5.​1 Introduction

This chapter will focus on how to interpret meaning, introducing scientific and logical methods for processing meaning, known as meaning representation.

5.​2 What Is Meaning?​

Meaning refers to the message conveyed by words, phrases, and sentences or utterances within a given context.

Often referred to as lexical or semantic meaning.

Meanings of sentences are not simply the combination of words' meanings, but usually the phrasal words with specific meanings at the pragmatic level (off the wagon).

Semantic meaning is the study of meaning assignment to minimal meaning-bearing elements.

5.​3 Meaning Representations

Unlike parse trees, meanning representations are not primarily a description of the input structure, but a representation of how humans understand, represent anything (such as actions, events, objects, etc.).

5 types of meaning representation:

  1. categories- specific entitities
  2. events
  3. time- specific moment
  4. aspects
    1. Stative- facts
    2. Activity
    3. Accomplishment?
    4. Achievement?
  5. beliefs, desires, and intentions

5.​4 Semantic Processing

Semantic processing encodes and interprets meaning.
  1. Reason relations with the environment.
  2. Answer questions based on contents.
  3. Perform inference based on knowledge and determine the verity of unknown facts.

5.​5 Common Meaning Representation

4 common methods of meaning representation.

5.​5.​1 First-Order Predicate Calculus (FOPD)

Also known as predicate logic.

Expresses the relationship between information objects as predicates.

5.​5.​2 Semantic Networks

Knowledge representation techniques used for propositional information. A semantic net can be represented as a labeled directed graph.

5.​5.​3 Conceptual Dependency Diagram (CDD)

Theory that describes how sentence/utterance meaning is represented for reasoning.

5.​5.​4 Frame-Based Representation

Frames and notions as basic components to characterize domain knowledge.

A frame is a knowledge configuration to characterize a concept such as a car or driving a car. (like an object with properties)

5.​6 Requirements for Meaning Representation

5.​6.​1 Verifiability

determining whether a sentence/utterance has a literal meaning

5.​6.​2 Ambiguity

word, statement, or phrase that consists of more than one meaning

5.​6.​3 Vagueness

borderline cases (short, tall,...)

5.​6.​4 Canonical Forms

5.6.4.1 What Is Canonical Form?

Simplest expression. Instead of 0.000, just say 0 .

5.6.4.2 Canonical Form in Meaning Representation

Instead of "Jack likes to eat candy all day.", just say "Jack eats candy."

5.6.4.3 Canonical Forms: Pros and Cons

Advantages- 1)Simplify reasoning and storage operations. 2)No need to generate inference rules for all different variations.

May complicate semantic analysis.

5.​7 Inference

5.​7.​1 What Is Inference?​

deduction and induction

Induction goes from specific to general; Deduction goes from general to specific.

Induction builds theories. Deduction tests them.

5.​7.​2 Example of Inferencing with FOPC

5.​8 Fillmore’s Theory of Universal Cases

Case grammar is a linguistic system that focuses on the connection between the quantity such as the subject, object, or valence of a verb and the grammatical context.

Only a limited number of semantic roles (case roles) occur in every sentence constructed with verbs.

5.​8.​1 What Is Fillmore’s Theory of Universal Cases?​

Each verb needs a certain number of case roles to form a case-frame

Thematic role (semantic role) refers to case role that a noun phrase (NP) may deploy with respect to action or state used by the main verb.

5.​8.​2 Major Case Roles in Fillmore’s Theory

  1. Agent—doer of action, attribute intention.
  2. Experiencer—doer of action without intention.
  3. Theme—thing that undergoes change or being acted upon with
  4. Instrument—tool being used to perform the action.
  5. Beneficiary—person or thing for which the action being acted on or performed to.
  6. To/At/From Loc/Poss/Time—to possess things, place, or time.

5.​8.​3 Complications in Case Roles

4 types of complications in case role analysis:
  1. Syntactic constituents’ ability to indicate semantic roles in several cases
  2. Syntactic expression option availability
  3. Prepositional ambiguity not always introduces the same role
  4. Role options in a sentence

5.8.3.1 Selectional Restrictions

Selectional restrictions are methods to restrict types of certain roles to be used for semantic consideration.

5.​9 First-Order Predicate Calculus

5.​9.​1 FOPC Representation Scheme

FOPC can be used as a framework to derive semantic representation of a sentence.

FOPC supports:
  1. Reasoning in truth condition analysis to respond yes or no questions.
  2. Variables in general cases through variable binding at responses and storage.
  3. Inference to respond beyond KB storage on new knowledge.

5.​9.​2 Major Elements of FOPC

  1. terms
    1. constants- specific object described in sentence
    2. functions- concepts expressed as genitives (owner) such as brand name, location,
    3. variables- objects without reference which object is referred to
  2. predicates- property of a subject; usually verbs
  3. connectives- conjunctions, implications(if...then), negations
  4. quantifiers

5.​9.​3 Predicate-Argument Structure of FOPC

5.​9.​4 Meaning Representation Problems in FOPC

5.​9.​5 Inferencing Using FOPC

Inference- validate or prove whether a proposition is true or false from a KB.

Modus Ponens (MP)- If P is true, then Q must also be true. P→Q

Errata

8. Transfer Learning and Transformer Technology

8.​1 What Is Transfer Learning?​

Transfer learning (TL) involves solving a problem by leveraging acquired knowledge and applying that knowledge to address another related problem

8.​2 Motivation of TL

Traditional ML datasets and trained model parameters cannot be reused.

8.​2.​1 Categories of TL

Heterogeneous and Homogeneous

8.​3 Solutions of TL

8.​3.​1 Instance-Based Method

reweights samples from source domains

8.​3.​2 Feature-Based Method

  • works for both heterogeneous and homogeneous TL problems
  • Asymmetric feature transformation aims to modify the source domain and reduce the gap between source and target instances by transforming one of the source and target domains to the other
  • Symmetric feature transformation aims to transform source and target domains into their shared feature space
  • 8.​3.​3 Parameter-Based Method

  • transfers learned knowledge by sharing parameters common to the models of source and target learners
  • trained model is transferred from source domain to target domain with parameters
  • This approach can train more than one model on the source data and combine parameters learned from all models to improve results of the target learner.
  • 8.​3.​4 Relational-Based Method

  • transfers learned knowledge by sharing its learned relations between different sample parts of source and target domains
  • 8.​4 Recurrent Neural Network (RNN)

    8.​4.​1 What Is RNN?​

    The RNN has memory which means its output is influenced by prior elements of the sequence against traditional feedforward neural network (FNN)

    8.​4.​2 Motivation of the RNN

    To feel under the weather means unwell. This phrase makes sense only when it is expressed in that specific order.

    There are five major categories of RNN architecture corresponding to different tasks:
    1. simple one-to-one model for image classification task
    2. one-to-many model for image captioning tasks
    3. many-to-one model for sentiment analysis tasks
    4. many-to-many models for machine translation
    5. complex many-to-many models for video classification tasks

    8.​4.​3 RNN Architecture

    A significant difference between RNN and traditional neural networks is that weights and bias U, W, and V are shared among layers.

    Partly recurrent is a layered network with distinct output and input layers where recurrence is limited to the hidden layer. Fully connected recurrent neural network (FRNN) connects all neurons’ outputs to inputs

    8.​4.​4 Long Short-Term Memory (LSTM) Network

  • LSTM is a type of the RNN with special hidden layers to deal with gradient explosion and disappearance problems during long sequence training process
  • better performance with training longer sequences against naïve RNNs.
  • two hidden layers as the RNN where a memory cell in the layer is to replace the hidden node.
  • LSTM has forget, memory select, and output stages.
  • 8.​4.​5 Gate Recurrent Unit (GRU)

    GRU can be considered as a kind of the RNN like LSTM but to manage backpropagation gradient problem

    8.​4.​6 Bidirectional Recurrent Neural Networks (BRNNs)

    BRNN is a type with RNN layers in two directions. It links with previous and subsequent information outputs to perform inference against both RNN and LSTM to possess information from the previous one. BRNN consists of two RNNs superimposed on top of each other. The output is mutually generated by two RNN states.

    8.​5 Transformer Technology

    8.​5.​1 What Is Transformer?​

    The Transformer is a network architecture based on the attention mechanism, without relying on recurrent or convolutional units

    8.​5.​2 Transformer Architecture

    A transformer model has two parts: (1) encoder and (2) decoder. Language sequence extracts as input, encoder maps it into a hidden layer, and decoder maps the hidden layer inversely to a sequence as output.

    8.​5.​3 Deep into Encoder

    8.​6 BERT

    8.​6.​1 What Is BERT?​

    BERT is a pretrained model of language representation called Bidirectional Encoder Representation from Transformers

    8.​6.​2 Architecture of BERT

    BERT models are pretrained either by left-to-right or right-to-left language models previously

    8.​6.​3 Training of BERT

    BERT has two training process steps: (1) pretraining and (2) fine-tuning.

    8.​7 Other Related Transformer Technology

    8.​7.​1 Transformer-XL

    8.​7.​2 ALBERT

    References

    10 Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI)

    10.​1 Introduction to LLM and GenAI

    10.​1.​1 What Is a Large Language Model (LLM)?​

  • innovative machine learning models designed to learn from textual data, understand language patterns such as grammar, syntax, context, semantics; and process by models’ sophisticated architectures to generate relevant coherent and contextual text; translate languages; summarize content and answer questions in NLP.
  • Transformers use self-attention mechanisms to weigh the importance of different words in a sentence regardless of their positions to overcome RNNs’ and LSTMs’ limitations.
  • 10.​1.​2 Understanding Generative Artificial Intelligence (GenAI)

  • GenAI- AI systems that can generate new content regardless of text, images, music, or other forms of media based on the learnt patterns from vast datasets.
  • Generative Adversarial Network (GAN) consists of two neural networks (1) a generator to create new data instances and (2) a discriminator to evaluate them against real-world data.
  • 10.​1.​3 The Intersection of LLM and GenAI

    10.​1.​4 The Importance of LLMs in Modern AI

    1. Applications Across Industries: LLMs can automate complex tasks required by human expertise previously.
    2. Conversational AI and Customer Service
    3. Enhancing Creativity and Content Generation
    4. Multilingual and Cross-Cultural Communication
    5. The Future of Human-Machine Interaction

    10.​2 Foundations of LLMs

    10.​2.​1 Neural Network Architectures

  • Multi-Layer Perceptron (MLP) is one of the earliest neural networks
  • RNNs
  • LSTMs introduced memory cells to store information over longer time spans
  • GRUs ( Gated Recurrent Units) simplified LSTM by merging certain gates
  • Convolutional Neural Networks (CNNs) primarily used in computer vision
  • 10.​2.​2 Attention Mechanisms

  • Attention mechanisms have revolutionized NLP by allowing models to selectively focus on relevant parts of the input sequence when making predictions.
  • The encoder, in traditional sequence-to-sequence models, processes the input sequence into a fixed-size representation
  • The decoder compels the entire input sequence to a fixed-size vector in dealing with long sentences complications, allowing the decoder “attends” to different parts of the input sequence at each step during the decoding process to assign different weights on each input token
  • self-attention allowed each token in the sequence attends to every other token in the same sequence.
  • 10.​2.​3 The Transformer Architecture

  • Transformer architecture signified a fundamental shift in LLMs design by building upon the self-attention concept and discarded RNNs’ sequential processing nature
  • core component is a multi-head self-attention mechanism focusing on different parts of the input sequence simultaneously.
  • outputs of all attention heads are then concatenated and traversed in a feed-forward network to generate the final output
  • 10.​2.​4 Scaling Up:​ From BERT to GPT

  • BERT is a Transformer-based model to understand the contextual relationships between words in a sentence.
  • BERT proposed bidirectional training to learn from both directions simultaneously.
  • This approach allowed BERT to capture abundant contextual information and improve performance on a wide range of NLP tasks
  • training process consists of two major steps: pretraining and fine-tuning.
  • In pretraining, the model is trained on a large corpus using two unsupervised tasks—masked language modeling (MLM) and next sentence prediction (NSP)
  • In MLM, random words in a sentence are masked, and the model is tasked with predicting the missing words.
  • NSP trains the model to understand relationships between sentences.
  • GPT is a unidirectional model that processed text from left to right
  • GPT is trained by an auto-regressive approach to predict the next word in a sequence based on the previous words
  • Larger models not only improve performance on language tasks but also exhibit emergent capabilities that are absent in smaller models.
  • 10.​3 Key Players in the LLM Landscape

    10.​3.​1 ChatGPT by OpenAI (Current Version:​ GPT-4)

  • breakthrough in LLMs and NLP. These models are designed to generate human-like text by leveraging a Transformer-based architecture and vast datasets. Each generation from GPT-1 to recent GPT-4 has demonstrated increasing complexity, performance, and applicability levels.
  • pretrained on a diverse corpus of internet data including articles, books, websites, and other content available for public
  • 10.​3.​2 Pathways Language Model (PaLM) by Google DeepMind (Current Version:​ PaLM 2)

    superseded by Gemini

    10.​3.​3 Large Language Model Meta AI (LLaMA) by Meta (Current Version:​ LLaMA 2)

  • One of its motivations is to develop a model that can match or exceed performance like GPT-3’s with lower computational overheads cost.
  • larger datasets and well-tuned training strategies can be achieved with fewer parameters.
  • 10.​3.​4 Claude by Anthropic (Current Version:​ Claude 2)

  • LLM designed for safety, alignment, and usability
  • prioritizes ethical considerations
  • Constitutional AI Framework: This approach embeds ethical constraints into the model’s learning process. Claude is trained using a set of guiding principles (a “constitution”) to evaluate and correct responses without extensive post-deployment modification.
  • 10.​3.​5 ERNIE 3.​0 Titan by Baidu

  • trained in Chinese and English
  • combines auto-regressive (like GPT) or auto-encoding (like BERT) architecture
  • 10.​4 Applications of LLMs in GenAI

    10.​4.​1 Creative Writing and Content Generation

    10.​4.​2 Language Translation

    10.​4.​3 Conversational AI and Chatbots

    10.​4.​4 Text Summarization and Content Curation

    10.​5 Ethical Considerations and Challenges

    10.​5.​1 Detecting and Mitigating Bias

  • begins with conscientious curation of training data.
  • algorithmic fairness techniques such as adversarial training, where a secondary model is used to detect and correct biased outputs from the primary model
  • Human-in-the-loop systems
  • 10.​5.​2 Privacy and Data Security

    10.​5.​3 The Spread of Misinformation

    Verifying information generated by black-box LLMs is particularly challenging because they do not provide sources for their outputs.

    10.​5.​4 Ethical Guidelines for LLM Deployment

    10.​6 Future Outlook and Research

    10.​6.​1 Current Trends in LLMs and GenAI

  • Multimodal models are designed to integrate text, images, audio, video, and progress beyond just textual data.
  • LLMs have advanced significantly from GPT-2 with 1.5 billion parameters to GPT-4 with hundreds of billions of parameters.
  • Specialized LLMs are domain-specific models that can be fine-tuned for industries such as medicine, law, or finance.
  • Few-shot and zero-shot learning refer to LLMs’ capabilities to perform tasks with minimal or no task-specific data.
  • 10.​6.​2 The Future of Creativity in AI

    10.​6.​3 The Role of LLMs in AI Ethics

    10.​6.​4 The Path Forward:​ Research and Development

    Errata

  • 10.1.1- It generates human-like text ranging from essays composition to code snippets creation with over 175 billion parameters to capture intricate linguistic nuances than lesser models’ endeavors.
  • Fig 10.1- What does the vertical axis mean?
  • 10.2.1-
    • Neural networks are the foundation of machine learning and NLP in LLMs development to mimic the cognitive functions of the human brain.
    • Multi-Layer Perceptron (MLP) is one of the earliest neural networks restrained by capturing temporal dependencies in data for language tasks.
    • Convolutional Neural Networks (CNNs) primarily used in computer vision tasks have accessed to NLP through sentence classification and character-level modeling applications
    14. Workshop #4 Semantic Analysis and Word Vectors Using spaCy

    14.​1 Introduction

    14.​2 What Are Word Vectors?​

    A word vector is a dense representation of a word.

    14.​3 Understanding Word Vectors

    Or word2vec

    14.​3.​1 Example:​ A Simple Word Vector

    14.4 A Taste of Word Vectors

    14.5 Analogies and Vector Operations

    14.6 How to Create Word Vectors?​

    14.7 spaCy Pretrained Word Vectors

    14.8 Similarity Method in Semantic Analysis

    14.9 Advanced Semantic Similarity Methods with spaCy

    14.9.1 Understanding Semantic Similarity

    14.9.2 Euclidean Distance

  • Distance between 2 points on a graph.
  • (x1-x2)2 + (y1-y2)2
  • 14.9.3 Cosine Distance and Cosine Similarity

  • Cosine distance is more concerned with the orientation (angles) of two vectors in the space.
  • Similarity score ranges from 0 to 1.
  • 14.9.4 Categorizing Text with Semantic Similarity

    14.9.5 Extracting Key Phrases

    A noun phrase (NP) is a group of words that consist of a noun and its modifiers. Modifiers are usually pronouns, adjectives, and determiners.

    14.9.6 Extracting and Comparing Named Entities

    Errata

    15 Workshop #5:​ Sentiment Analysis and Text Classification (Hour 9–10)

    15.​1 Introduction

    NLTK and spaCy are two major NLP Python implementation tools for basic text processing, N-gram modeling, POS tagging, and semantic analysis

    1. Study text classification concepts in NLP and how spaCy NLP pipeline works on text classifier training.
    2. Use movie reviews as a problem domain to demonstrate how to implement sentiment analysis with spaCy.
    3. Introduce Artificial Neural Networks (ANN) concepts, TensorFlow, and Kera technologies.
    4. Introduce sequential modeling scheme with LSTM technology using movie reviews domain as example to integrate these technologies for text classification and movie sentiment analysis.

    15.​2 Text Classification with spaCy and LSTM Technology

    TextCategorizer is a spaCy‘s text classifier component applied in dataset for sentiment analysis to perform text classification with two vital Python frameworks (1) TensorFlow Keras API and (2) spaCy technology.

    15.​3 Technical Requirements

    15.​4 Text Classification in a Nutshell

    15.​4.​1 What Is Text Classification?​

    Text Classification is the task of assigning a set of predefined labels to text.

    • Language detection is the first step of many NLP systems, i.e., machine translation.
    • Topic generation and detection are the process of summarization, or classification of a batch of sentences into certain topic of interest (TOI) or topic titles
    • Sentiment analysis to classify or analyze users’ responses, comments, and messages on a particular topic attribute to positive, neutral, or negative sentiments.

    15.​4.​2 Text Classification as AI Applications

    Text classification is considered as supervised-learning (SL) task

    1. Binary text classification refers to categorizing text into two classes.
    2. Multi-class text classification refers to categorizing texts with more than two classes.
    3. Multi-label text classification system is to generalize its multi-class counterpart assigned to each example text (e.g., toxic, severe toxic, threat)

    15.​5 Text Classifier with spaCy NLP Pipeline

    TextCategorizer (tCategorizer) is spaCy‘s text classifier component. It required class labels and examples in NLP pipeline to perform training procedure.

    15.​5.​1 TextCategorizer Class

    TextCategorizer consists of (1) single-label and (2) multi-label classifiers.

    15.​5.​2 Formatting Training Data for the TextCategorizer

    15.​5.​3 System Training

    15.​5.​4 System Testing

    15.​5.​5 Training TextCategorizer for Multi-Label Classification

    15.​6 Sentiment Analysis with spaCy

    15.​6.​1 IMDB Large Movie Review Dataset

    15.​6.​2 Explore the Dataset

    15.​6.​3 Training the TextClassfier

    15.​7 Artificial Neural Network in a Nutshell

    This workshop section will learn how to incorporate spaCy technology with ANN technology using TensorFlow and its Keras package.

    A typical ANN has
    1. Input layer consists of input neurons, or nodes
    2. Hidden layer consists of hidden neurons, or nodes
    3. Output layer consists of output neurons, or nodes

    15.​8 An Overview of TensorFlow and Keras

  • TensorFlow is a popular Python tool widely used for machine learning.
  • Kerasis a Python-based deep learning tool that can be integrated with Python platforms such as TensorFlow, Theano, and CNTK.
  • 15.​9 Sequential Modeling with LSTM Technology

    Keras has extensive support for RNN variations GRU, LSTM and simple API for training RNNs.

    15.​10 Keras Tokenizer in NLP

    Tokens are vectorized by the following steps:
    1. Tokenize each utterance and turn these utterances into a sequence of tokens.
    2. Build a vocabulary from set of tokens presented in Step 1. These are tokens to be recognized by neural network design.
    3. Create a vocabulary and assign ID to each token.
    4. Map token vectors with corresponding token-IDs.

    15.​10.​1 Embedding Words

    Tokens can be transformed into token vectors. Embedding tokens into vectors occurred via a lookup embedding table.
    1. Each utterance is divided into tokens and built a vocabulary with Keras‘Tokenizer.
    2. The Tokenizer object held a token index with a token->token-ID mapping.
    3. When a token-ID is obtained, lookup to embedding table rows with this token-ID to acquire a token vector.
    4. This token vector is fed to neural network.

    15.​11 Movie Sentiment Analysis with LTSM Using Keras and spaCy

    This section will demonstrate the design of LSTM-based RNN text classifier for sentiment analysis with steps below:
    1. Data retrieval and preprocessing.
    2. Tokenize review utterances with padding.
    3. Create utterances pad sequence and put it into input layer.
    4. Vectorize each token and verify by token-ID in embedding layer.
    5. Input token vectors into LSTM.
    6. Train LSTM network.
  • Step 1: Dataset
  • Step 2: Data and vocabulary preparation
  • Step 3: Implement the Input Layer
  • Step 4: Implement the Embedding Layer
  • Step 5:​ Implement the LSTM Layer
  • Step 6:​ Implement the Output Layer
  • Step 7:​ System Compilation
  • Step 8: Model Fitting and Experiment Evaluation