Definitions
| Bigram | 2 words. N=2. I am. He is. We are. Used frequently. |
|---|---|
| Bigram normalization | division of each bigram counts by appropriate unigram counts for wn-1 |
| Corpus | a collection of written texts |
| Language model | traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence |
| Lemma | Basic word shared by word forms. In the dictionary. |
| Lemmatization | Get the lemma. Slow. Am, is, are, was, were → Be |
| Morpheme | Smallest meaningful component of a word. |
| Morphology | Study of words. |
| N-gram | statistical model consisting of a sequence of N words. |
| Quadrigram | 4 words. N=4. |
| Sentence | unit of written language; domain specific |
| Stem | Root of a word. Might not be in the dictionary. |
| Stemming | Get the words root. Fast. Student, studious, study → Stud |
| Tokens | can be meaningful words or symbols, punctuations, or distinct characters |
| Tuple | Ordered, immutable collection of elements. |
| Trigram | 3 words. N=3. I am good. See spot run. We go there. Used rarely. |
| Types/Word Types | distinct words in a corpus |
| Unigram | 1 word. N=1. Seldom used. |
| Utterance | unit of spoken language; domain and culture specific |
| Word Form | another basic entity in a corpus |
Abbreviations
| ASR | Automatic Speech Recognition |
|---|---|
| ATN | Augmented Transition Network |
| BERT | Bidirectional Encoder Representations |
| CDD | Conceptual Dependency Diagram |
| CFG | Context Free Grammar |
| CFL | Context Free Language |
| CNF | Chomsky Normal Form |
| CNN | Convolutional Neural Networks |
| CR | Coreference Resolution |
| DT | Determiner |
| FOPC | First Order Predicate Calculus |
| GPT | Generative Pretrained Transformer |
| GRU | Gate Recurrent Unit |
| HMM | Hidden Markov Model |
| IE | Information Extraction |
| IR | Information Retrieval |
| LLM | Large Language Model |
| LME | Language Model Evaluation |
| LSTM | Long Short Term Memory |
| MEMM | Maximum Entropy Markov Model |
| MLE | Maximum Likelihood Estimation |
|---|---|
| MeSH | Medical Subject Headings |
| NER | Named Entity Recognition |
| NLTK | Natural Language Toolkit |
| NLU | Natural Language Understanding |
| NN | Noun |
| NNP | Proper Noun |
| Nom | Nominal |
| NP | Noun Phrase |
| PCFG | Probabilistic Context Free Grammar |
| PMI | Point-Wise Mutual Information |
| POS | Part Of Speech |
| POST | Part Of Speech Tagging |
| PPMI | Positive Point-wise Mutual Information |
| RLHF | Reinforcement Learning from Human Feedback |
| RNN | Recurrent Neural Network |
| TBL | Transformation Based Learning |
| VB | Verb |
| VP | Verb Phrase |
| WSD | Word Sense Disambiguation |
1. Natural Language Processing
1.1 Introduction
Modern chatbots represent human-computer interaction and require world knowledge and domain knowledge.
Knowledge is organized into knowledge trees or ontology graphs.
NLP is cross-disciplinary integration of philosophy, psychology, linguistics, and computational linguistics.
computational linguistics - multidisciplinary study of epistemology, philosophy, psychology, cognitive science, and agent ontology.
1.2 Human Language and Intelligence
Core technologies and methodologies arose from the Turing Test.
Human language is categorized as written and oral.
1.3 Linguistic Levels of Human Language
6 (7?) levels of linguistics analysis: phonetics phonology, morphology, lexicology, syntax, semantics, and pragmatics.
1.4 Human Language Ambiguity
- Lexical ambiguity- words can have different meanings
- Syntactic- who is holding the bag?
- Semantic- what does "it" mean in a sentence?
- Pragmatic- what does the sentence mean?
1.5 A Brief History of NLP
1.5.1 First Stage: Machine Translation (Before the 1960s)
Leibniz and Descartes codified relationships between words and sentences.
Alan Turing wrote Computing Machinery and Intelligence in 1950 which proposed the Turing test.
The Georgetown-IBM experiment translated over 60 Russian sentences in 1957.
Chomsky wrote Syntactic Structures in 1957.
The ALPAC report threw cold water on NLP research in 1966.
1.5.2 Second Stage: Early AI on NLP (1960s-1970s)
The Baseball Q&A program was developed in 1961.
Marvin Minsky wrote Semantic Information Processing in 1968.
William Woods developed augmented transition networks (ATN) in 1970.
1.5.3 Third Stage: Grammatical Logic on NLP (1970s-1980s)
1.5.4 Fourth Stage: AI and Machine Learning (1980s-2000s)
IBM began developing Watson, a DeepQA program which could compete on Jeopardy.
1.5.5 Fifth Stage: Rise of BERT, Transformer, ChatGPT, and LLMs (2000s-Present)
Long short-term memory (LSTM) recurrent neural networks (RNN) became dominant.
Google Brain published Attention Is All You Need which introduced the transformer architecture.
Google introduced BERT.
OpenAI developed ChatGPT which utilizes generative pre-trained transformers.
1.6 NLP and AI
1.7 Main Components of NLP
NLP consists of: Natural Language Understanding (NLU), Knowledge Acquisition and Inferencing (KAI), and Natural Language Generation (NLG).
1.8 Natural Language Understanding (NLU)
NLU is a process of understanding spoken language in four stages: speech recognition, and syntactic, semantic, and pragmatic analysis.
1.8.1 Speech Recognition
Speech recognition is the first stage in NLU that performs phonetic, phonological, and morphological processing to analyze spoken language.
Breaks down the stems of spoken words (utterances) into distinct tokens representing paragraphs, sentences, and words in different parts.
Current speech recognition models use spectrogram analysis.
1.8.2 Syntax Analysis
Reject phrases like, "I you love."
1.8.3 Semantic Analysis
Reject nonsense like, "hot snowflakes".
1.8.4 Pragmatic Analysis
Requires expert knowledge or just common sense.
1.9 Potential Applications of NLP
NLP is used for translation, information extraction (IE), information retrieval (IR), sentiment analysis, and chatbots.
1.9.1 Machine Translation (MT)
Earliest NLP application.
Although it is not difficult to translate one language to another...(?!)
1.9.2 Information Extraction (IE)
1.9.3 Information Retrieval (IR)
1.9.4 Sentiment Analysis
1.9.5 Question-Answering (Q&A) Chatbots
Errata
- 1.5- No mention of Eliza.
- 1.5.2- Further, Professor William A. Woods proposed an augmented translation network (ATN) to represent natural language input in 1970. (Woods proposed augmented transition networks.)
- 1.5.2- Minsky didn"t develop an NLP system in 1968. He wrote a book (important).
- 1.5.4- Hopfield networks did not have an significant impact on NLP.
2. N-Gram Language Model
2.1 Introduction
- NLP entities like word-to-word tokenization using NTLK and spaCy technologies in Workshop 1 (Chap. 11) analyzed words in insolation
- In many NLP applications, there are noises and disruptions affecting incorrect word pronunciation
- Humans experience mental confusion about spelling errors
- Word prediction can provdie spell checking and model words relationships
- Probability or word counting method can work on a large databank called a corpus
2.2 N-Gram Language Model
2.2.1 Basic NLP Terminology
| Sentence | unit of written language; domain specific |
|---|---|
| Utterance | unit of spoken language; domain and culture specific |
| Word Form | another basic entity in a corpus |
| Types/Word Types | distinct words in a corpus |
| Tokens | can be meaningful words or symbols, punctuations, or distinct characters |
| Stem | Root of a word. Might not be in the dictionary. |
| Stemming | Get the words root. Fast. Student, studious, study → Stud |
| Lemma | Basic word shared by word forms. In the dictionary. |
| Lemmatization | Get the lemma. Slow. Am, is, are, was, were → Be |
| Over a trillion English tokens with over a million meaningful wordform types sufficient to generate sentences/utterances for daily use | |
| Brown Corpus | First well-organized corpus. Brown University. 1961. Over 583 million tokens, 293,181 wordform types and foreign words. |
| Wall Street Journal | Financial domain |
| Associated Press | International news |
| Hansard | British parliamentary speeches |
| BU Broadcast News Corpus | |
| NLTK Corpus |
A language model is a traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence.
2.2.2 Language Modeling and Chain Rule
2.3 Markov Chain in N-Gram Model
A Markov chain is a process that describes a sequence of possible events where the probability of each event depends only the previous event.
Entire sentences can be modelled as Markov chains.
2.4 Example: The Adventures of Sherlock Holmes
Maximum Likelihood Estimates (MLE) is another method to calculate the N-gram model.
Bigram normalization is the division of bigram counts by unigram counts for wn-1.
2.5 Shannon’s Method in N-Gram Model
- Choose a random N-gram (<s>, w) according to its probability
- Now choose a random N-gram (w, x) according to its probability
- And so on until we choose </s>
- String the words together into a sentence
Constructing sentences from trigrams seems to be most suitable.
2.6 Language Model Evaluation and Smoothing Techniques
Language Model Evaluation (LME) is a standard method to train parameters on a training set and to review model performance with new data constantly.
That often occurred in real world to learn how the models perform called training data (training set) on language model and see whether it works with unseen information called test data (test set).
Can deal with unkown words.
2.6.1 Perplexity
Perplexity (PP) is the probability of the test set assigned by the language model, normalized by the number of words.
minimizing perplexity is the same as maximizing probability for model performance
2.6.2 Extrinsic Evaluation Scheme
Simply compare two models’ application performance regarding N-gram evaluation.
Time consuming
2.6.3 Zero Counts Problems
Many possible bigrams never show up in training data.
Zipf's Law- in a large corpus, the frequency of any word is inversely proportional to its rank, where the most frequent word occurs twice as often as the second, and thrice as often as the third.
2.6.4 Smoothing Techniques
2.6.5 Laplace (Add-One) Smoothing
2.6.6 Add-k Smoothing
2.6.7 Backoff and Interpolation Smoothing
2.6.8 Good Turing Smoothing
Errata
- 2.1- NLP entities like word-to-word tokenization using NTLK and spaCy technologies in Workshop 1 (Chap. 11) analyzed words in insolation, but the relationship between words is important in NLP.
- 2.1- When applying probability to word prediction in an utterance, there are words often proposed by rank and frequency to provide a sequential optimum estimation.
- 2.2- It was learned that the motivations for word prediction can apply to voice recognition, text generation, and Q&A chatbot.
- 2.2.1- A language model is a traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence...
- 2.6.3- Further, Zipf's law states that a long tail phenomenon is a rare event that occurs in a very high frequency, and large events numbers occur in a low frequency constantly.
- Author doesn't explain the difference between the conditional probablity formula and the chain rule.
3. Part-of-Speech (POS) Tagging
3.1 What Is Part of Speech (POS)?
- Category of words that have similar grammatic behaviors or properties.
- Inflection- items are added to the base form of a word to convey grammatical meanings (eg, cat and cats)
3.1.1 Nine Major POS in the English Language
- adjectives
- verbs
- pronouns
- conjunctions
- prepositions
- articles
- adverbs
- nouns
- interjections
3.2 POS Tagging
3.2.1 What Is POS Tagging in Linguistics?
Labelling a word according to a particular POS based on definition and contexts in linguistics.
3.2.2 What Is POS Tagging in NLP?
Automatic description assignment to words or tokens.
The operation of converting a sentence to forms, or a list of words and list of tuples, where each tuple has a word or tag form to signify POS.
3.2.3 POS Tags Used in the PENN Treebank Project
POS tag databank provided by the PENN Treebank corpus
classifies nine major POS into subclasses that have a total of 45 POS tags
Penn Treebank (PTB) corpus has a comprehensive section of WSJ articles
3.2.4 Why Do We Care About POS in NLP?
3.3 Major Components in NLU
- Morphology- understandings of shapes and patterns for every word of a sentence.
- POS tagging
- Syntax
- Semantics
- Discourse integration- relationship between different sentences and its contents.
3.3.1 Computational Linguistics and POS
POS tagging can be considered as the fundamental process in computational linguistics
3.3.2 POS and Semantic Meaning
3.3.3 Morphological and Syntactic Definition of POS
3.4 Nine Key POS in English
- pronoun
- verb
- adjective (quick)
- adverb (quickly)
- interjection (Hey!)
- noun
- conjunction (and, or, but)
- preposition (in, on, at)
- article (a, the)
3.4.1 English Word Classes
Two types of English word classes: closed and open .
Closed-class words are also known as functional/grammar words. They are closed since new words are seldom created in the class. (conjunctions, determiners, pronouns, and prepositions)
Open-class words are also known as lexical/content words. They are open because the meaning of open-class words can be found in dictionaries and therefore their meaning can be interpreted individually. (nouns, verbs, adjectives, and adverbs)
3.4.2 What Is a Preposition?
used before nouns
approximately 80-100 prepositions in English
3.4.3 What Is a Conjunction?
Coordinating conjunctions join words, clauses, or phrases of equal grammatic rank. (and, but, for, nor, or yet)
Subordinating conjunctions join independent and dependent clauses to present a relationship (as, although, because, since, though, while, whereas)
3.4.4 What Is a Pronoun?
3.4.5 What Is a Verb?
3.5 Different Types of POS Tagset
3.5.1 What Is Tagset?
A Tagset is a batch of POS tags (POST) to indicate the part of speech and sometimes other grammatical categories such as case and tense
3.5.2 Ambiguous in POS Tags
- Noun-verb ambiguity. (record)
- Adjective-verb ambiguity. (perfect)
- Adjective-noun ambiguity. (complex)
3.5.3 POS Tagging Using Knowledge
- dictionary
- morphological rules (capitalization, suffixes...)
- N-gram frequencies (next word prediction)
- structural relationships combination. (figure it out)
3.6 Approaches for POS Tagging
3 basic approaches to POS Tagging: Rule-based, Stochastic-based, and Hybrid Tagging
3.6.1 Rule-Based Approach POS Tagging
two-stage process: (1) dictionary consists of all possible POS tags (2) words with more than single tag ambiguity applied handwritten or grammatic rules to assign the correct tag(s) according to surrounding words
rule generation can be achieved by (1) hand creation and (2) training from a corpus with machine learning.
3.6.2 Example of Rule-Based POS Tagging
3.6.3 Example of Stochastic-Based POS Tagging
3.6.4 Hybrid Approach for POS Tagging Using Brill’s Taggers
3.6.5 What Is Transformation-Based Learning?
3.6.6 Hybrid POS Tagging: Brill’s Tagger
3.6.7 Learning Brill’s Tagger Transformations
3.7 Taggers Evaluations
3.7.1 How Good Is an POS Tagging Algorithm?
Errata
- 3.2.2- Tuple is undefined.
- 3.3.1- As human language is natural and the most polytropic? means of communication...
- 3.5.2 Ambiguous? in POS Tags
- 3.6.1- It is a two-stage process: (1) dictionary consists of all possible POS tags for basic concepts of words as abovementioned and (2) words with more than single tag ambiguity applied handwritten or grammatic rules to assign the correct tag(s) according to surrounding words.
4. Syntax and Parsing
4.1 Introduction and Motivation
Grammatic rules are used to create sentences with syntactic rules.
Syntax analysis is used to analyze the structure and the relationship between tokens to create a parse tree.
4.2 Syntax Analysis
4.2.1 What Is Syntax
Syntax is the set of rules that govern how groups of words are combined to form phrases, clauses, and sentences.
Syntax can be defined as the correct arrangement of word tokens in written or spoken sentences
4.2.2 Syntactic Rules
Syntactic rules also described to assist language parts make sense.
4.2.3 Common Syntactic Patterns
There are seven common syntactic patterns:
- Subject → Verb
- Subject → Verb → Direct Object
- Subject → Verb → Subject Complement
- Subject → Verb → Adverbial Complement
- Subject → Verb → Indirect Object → Direct Object
- Subject → Verb → Direct Object → Direct Complement
- Subject → Verb → Direct Object → Adverbial Complement
4.2.4 Importance of Syntax and Parsing in NLP
5 major components in Natural Language Understanding (NLU):
- Morphology
- POS Tagging
- Syntax
- Semantics
- Discourse Integration
4.3 Types of Constituents in Sentences
4.3.1 What Is Constituent?
A constituent is considered as the linguistic component of a language.
Words or phrases that combine into a sentence are constituents.
4.3.2 Kinds of Constituents
- noun-phrase
- verb-phrase
- preposition-phrase
4.3.2.1 Noun Phrase (NP)
consists of a noun and its modifiers
4.3.2.2 Verb Phrase (VP)
consists of a main verb accompanied by linking verbs or modifiers
4.3.3 Complexity on Simple Constituents
Big red handbag makes sense. red big handbag does not.
4.3.4 Verb Phrase Subcategorization
Traditional English grammar classifies verbs into transitive (object) and intransitive subcategories; modern English grammars identify more than 100 subcategories.
4.3.5 The Role of Lexicon in Parsing
A lexicon is the vocabulary of a language or a specific field of knowledge,
Linguists believe that all languages are composed of two major components: (1) lexicon and (2) grammar.
Items within a lexicon are called lexemes, and groups of lexemes are called lemmas, often used to describe the size of a lexicon.
Lexical analysis- understand what words mean and intuit contexts.
A program that performs such lexical analysis is called tokenizer, lexer, or scanner.
A lexer is combined with a parser generally to analyze the syntax of text.
4.3.6 Recursion in Grammar Rules
4.4 Context-Free Grammar (CFG)
4.4.1 What Is Context Free Language (CFL)?
Context-free language (CFL) is a superset of Regular Language (RL) generated by context-free grammar (CFG) which means every RL is a CFL but not all CFL is a RL
CFL is:4 levels of human language:
- Recursively enumerable language as a superset of language model.
- Context-sensitive language, a subset of recursively enumerable language.
- Subsets of context-sensitive language.
- regular
- context-free
- context-sensitive
- recursively enumerable
Most arithmetic expressions generated by a Context-Free Grammar (CFG) are CFLs.
4.4.2 What Is Context Free Grammar (CFG)?
CFG is to describe CFL as a set of recursive rules for generating string patterns.
CFG is commonly applied in linguists and compiler design to describe programming languages and parsers that can be created automatically.
4.4.3 Major Components of CFG
4 major components-
- A set of non-terminal symbols N
- A set of terminal symbols Σ
- A set of production rules P
- The designated start symbol S is a start symbol of the sentence.
4.4.4 Derivations Using CFG
4.5 CFG Parsing
3 CFG parsing levels: morphological, phonological, and syntactic
4.5.1 Morphological Parsing
Determines the morphemes of a word being constructed (cats is plural of cat).
4.5.2 Phonological Parsing
Phonological parsing is to interpret sounds into words and phrases to generate parser.
4.5.3 Syntactic Parsing
identify relevant components and correct grammar of a sentence.
4.5.4 Parsing as a Kind of Tree Searching
4.5.5 CFG for Fragment of English
4.5.6 Parse Tree for “Play the Piano” for Prior CFG
4.5.7 Top-Down Parser
4.5.8 Bottom-Up Parser
4.5.9 Control of Parsing
4.5.10 Pros and Cons of Top-Down vs. Bottom-Up Parsing
4.6 Lexical and Probabilistic Parsing
4.6.1 Why Using Probabilities in Parsing?
4.6.2 Semantics with Parsing
4.6.3 What Is PCFG?A probabilistic context-free grammar (PCFG) is a context-free grammar that associates each of its production rules with a probability.4.6.4 A Simple Example of PCFG
4.6.5 Using Probabilities for Language Modelling
4.6.6 Limitations for PCFG
4.6.7 The Fix–Lexicalized Parsing
Errata
- 4.2.2- Syntactic rules also described to assist language parts make sense.
- 4.3.6- The boy who come early today won the game.
- 4.5.2- Phonological parsing is to interpret sounds into words and phrases to generate parser.
5 Meaning Representation
5.1 Introduction
This chapter will focus on how to interpret meaning, introducing scientific and logical methods for processing meaning, known as meaning representation.
5.2 What Is Meaning?
Meaning refers to the message conveyed by words, phrases, and sentences or utterances within a given context.
Often referred to as lexical or semantic meaning.
Meanings of sentences are not simply the combination of words' meanings, but usually the phrasal words with specific meanings at the pragmatic level (off the wagon).
Semantic meaning is the study of meaning assignment to minimal meaning-bearing elements.
5.3 Meaning Representations
Unlike parse trees, meanning representations are not primarily a description of the input structure, but a representation of how humans understand, represent anything (such as actions, events, objects, etc.).
5 types of meaning representation:
- categories- specific entitities
- events
- time- specific moment
- aspects
- Stative- facts
- Activity
- Accomplishment?
- Achievement?
- beliefs, desires, and intentions
5.4 Semantic Processing
Semantic processing encodes and interprets meaning.
- Reason relations with the environment.
- Answer questions based on contents.
- Perform inference based on knowledge and determine the verity of unknown facts.
5.5 Common Meaning Representation
4 common methods of meaning representation.
5.5.1 First-Order Predicate Calculus (FOPD)
Also known as predicate logic.
Expresses the relationship between information objects as predicates.
5.5.2 Semantic Networks
Knowledge representation techniques used for propositional information. A semantic net can be represented as a labeled directed graph.5.5.3 Conceptual Dependency Diagram (CDD)
Theory that describes how sentence/utterance meaning is represented for reasoning.5.5.4 Frame-Based Representation
Frames and notions as basic components to characterize domain knowledge.
A frame is a knowledge configuration to characterize a concept such as a car or driving a car. (like an object with properties)
5.6 Requirements for Meaning Representation
5.6.1 Verifiability
determining whether a sentence/utterance has a literal meaning5.6.2 Ambiguity
word, statement, or phrase that consists of more than one meaning5.6.3 Vagueness
borderline cases (short, tall,...)5.6.4 Canonical Forms
5.6.4.1 What Is Canonical Form?
Simplest expression. Instead of 0.000, just say 0 .5.6.4.2 Canonical Form in Meaning Representation
Instead of "Jack likes to eat candy all day.", just say "Jack eats candy."5.6.4.3 Canonical Forms: Pros and Cons
Advantages- 1)Simplify reasoning and storage operations. 2)No need to generate inference rules for all different variations.
May complicate semantic analysis.
5.7 Inference
5.7.1 What Is Inference?
deduction and induction
Induction goes from specific to general; Deduction goes from general to specific.
Induction builds theories. Deduction tests them.
5.7.2 Example of Inferencing with FOPC
5.8 Fillmore’s Theory of Universal Cases
Case grammar is a linguistic system that focuses on the connection between the quantity such as the subject, object, or valence of a verb and the grammatical context.
Only a limited number of semantic roles (case roles) occur in every sentence constructed with verbs.
5.8.1 What Is Fillmore’s Theory of Universal Cases?
Each verb needs a certain number of case roles to form a case-frame
Thematic role (semantic role) refers to case role that a noun phrase (NP) may deploy with respect to action or state used by the main verb.
5.8.2 Major Case Roles in Fillmore’s Theory
- Agent—doer of action, attribute intention.
- Experiencer—doer of action without intention.
- Theme—thing that undergoes change or being acted upon with
- Instrument—tool being used to perform the action.
- Beneficiary—person or thing for which the action being acted on or performed to.
- To/At/From Loc/Poss/Time—to possess things, place, or time.
5.8.3 Complications in Case Roles
4 types of complications in case role analysis:
- Syntactic constituents’ ability to indicate semantic roles in several cases
- Syntactic expression option availability
- Prepositional ambiguity not always introduces the same role
- Role options in a sentence
5.8.3.1 Selectional Restrictions
Selectional restrictions are methods to restrict types of certain roles to be used for semantic consideration.
5.9 First-Order Predicate Calculus
5.9.1 FOPC Representation Scheme
FOPC can be used as a framework to derive semantic representation of a sentence.
FOPC supports:
- Reasoning in truth condition analysis to respond yes or no questions.
- Variables in general cases through variable binding at responses and storage.
- Inference to respond beyond KB storage on new knowledge.
5.9.2 Major Elements of FOPC
- terms
- constants- specific object described in sentence
- functions- concepts expressed as genitives (owner) such as brand name, location,
- variables- objects without reference which object is referred to
- predicates- property of a subject; usually verbs
- connectives- conjunctions, implications(if...then), negations
- quantifiers
5.9.3 Predicate-Argument Structure of FOPC
5.9.4 Meaning Representation Problems in FOPC
5.9.5 Inferencing Using FOPC
Inference- validate or prove whether a proposition is true or false from a KB.
Modus Ponens (MP)- If P is true, then Q must also be true. P→Q
Errata
- 5.3- what's the difference between an accomplishment and an achievement?
- 5.8- Case grammar is a linguistic system that focuses on the connection between the quantity(?) such as the subject, object, or valence of a verb and the grammatical context.
- 5.8.1- Jack gets a prize. Statement shows Jack is the agent as he is doer to get, the prize is the object being received, so it is a patient. What is patient?
8. Transfer Learning and Transformer Technology
8.1 What Is Transfer Learning?
Transfer learning (TL) involves solving a problem by leveraging acquired knowledge and applying that knowledge to address another related problem
8.2 Motivation of TL
Traditional ML datasets and trained model parameters cannot be reused.
8.2.1 Categories of TL
Heterogeneous and Homogeneous
8.3 Solutions of TL
8.3.1 Instance-Based Method
reweights samples from source domains8.3.2 Feature-Based Method
works for both heterogeneous and homogeneous TL problems Asymmetric feature transformation aims to modify the source domain and reduce the gap between source and target instances by transforming one of the source and target domains to the other Symmetric feature transformation aims to transform source and target domains into their shared feature space 8.3.3 Parameter-Based Method
transfers learned knowledge by sharing parameters common to the models of source and target learners trained model is transferred from source domain to target domain with parameters This approach can train more than one model on the source data and combine parameters learned from all models to improve results of the target learner. 8.3.4 Relational-Based Method
transfers learned knowledge by sharing its learned relations between different sample parts of source and target domains
8.4 Recurrent Neural Network (RNN)
8.4.1 What Is RNN?
The RNN has memory which means its output is influenced by prior elements of the sequence against traditional feedforward neural network (FNN)
8.4.2 Motivation of the RNN
To feel under the weather means unwell. This phrase makes sense only when it is expressed in that specific order.
There are five major categories of RNN architecture corresponding to different tasks:
- simple one-to-one model for image classification task
- one-to-many model for image captioning tasks
- many-to-one model for sentiment analysis tasks
- many-to-many models for machine translation
- complex many-to-many models for video classification tasks
8.4.3 RNN Architecture
A significant difference between RNN and traditional neural networks is that weights and bias U, W, and V are shared among layers.
Partly recurrent is a layered network with distinct output and input layers where recurrence is limited to the hidden layer. Fully connected recurrent neural network (FRNN) connects all neurons’ outputs to inputs
8.4.4 Long Short-Term Memory (LSTM) Network
LSTM is a type of the RNN with special hidden layers to deal with gradient explosion and disappearance problems during long sequence training process better performance with training longer sequences against naïve RNNs. two hidden layers as the RNN where a memory cell in the layer is to replace the hidden node. LSTM has forget, memory select, and output stages. 8.4.5 Gate Recurrent Unit (GRU)
GRU can be considered as a kind of the RNN like LSTM but to manage backpropagation gradient problem8.4.6 Bidirectional Recurrent Neural Networks (BRNNs)
BRNN is a type with RNN layers in two directions. It links with previous and subsequent information outputs to perform inference against both RNN and LSTM to possess information from the previous one. BRNN consists of two RNNs superimposed on top of each other. The output is mutually generated by two RNN states.
8.5 Transformer Technology
8.5.1 What Is Transformer?
The Transformer is a network architecture based on the attention mechanism, without relying on recurrent or convolutional units8.5.2 Transformer Architecture
A transformer model has two parts: (1) encoder and (2) decoder. Language sequence extracts as input, encoder maps it into a hidden layer, and decoder maps the hidden layer inversely to a sequence as output.8.5.3 Deep into Encoder
8.6 BERT
8.6.1 What Is BERT?
BERT is a pretrained model of language representation called Bidirectional Encoder Representation from Transformers8.6.2 Architecture of BERT
BERT models are pretrained either by left-to-right or right-to-left language models previously8.6.3 Training of BERT
BERT has two training process steps: (1) pretraining and (2) fine-tuning.
8.7 Other Related Transformer Technology
8.7.1 Transformer-XL
8.7.2 ALBERT
References
10 Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI)
10.1 Introduction to LLM and GenAI
10.1.1 What Is a Large Language Model (LLM)?
innovative machine learning models designed to learn from textual data, understand language patterns such as grammar, syntax, context, semantics; and process by models’ sophisticated architectures to generate relevant coherent and contextual text; translate languages; summarize content and answer questions in NLP. Transformers use self-attention mechanisms to weigh the importance of different words in a sentence regardless of their positions to overcome RNNs’ and LSTMs’ limitations. 10.1.2 Understanding Generative Artificial Intelligence (GenAI)
GenAI- AI systems that can generate new content regardless of text, images, music, or other forms of media based on the learnt patterns from vast datasets. Generative Adversarial Network (GAN) consists of two neural networks (1) a generator to create new data instances and (2) a discriminator to evaluate them against real-world data. 10.1.3 The Intersection of LLM and GenAI
10.1.4 The Importance of LLMs in Modern AI
- Applications Across Industries: LLMs can automate complex tasks required by human expertise previously.
- Conversational AI and Customer Service
- Enhancing Creativity and Content Generation
- Multilingual and Cross-Cultural Communication
- The Future of Human-Machine Interaction
10.2 Foundations of LLMs
10.2.1 Neural Network Architectures
Multi-Layer Perceptron (MLP) is one of the earliest neural networks RNNs LSTMs introduced memory cells to store information over longer time spans GRUs ( Gated Recurrent Units) simplified LSTM by merging certain gates Convolutional Neural Networks (CNNs) primarily used in computer vision 10.2.2 Attention Mechanisms
Attention mechanisms have revolutionized NLP by allowing models to selectively focus on relevant parts of the input sequence when making predictions. The encoder, in traditional sequence-to-sequence models, processes the input sequence into a fixed-size representation The decoder compels the entire input sequence to a fixed-size vector in dealing with long sentences complications, allowing the decoder “attends” to different parts of the input sequence at each step during the decoding process to assign different weights on each input token self-attention allowed each token in the sequence attends to every other token in the same sequence. 10.2.3 The Transformer Architecture
Transformer architecture signified a fundamental shift in LLMs design by building upon the self-attention concept and discarded RNNs’ sequential processing nature core component is a multi-head self-attention mechanism focusing on different parts of the input sequence simultaneously. outputs of all attention heads are then concatenated and traversed in a feed-forward network to generate the final output 10.2.4 Scaling Up: From BERT to GPT
BERT is a Transformer-based model to understand the contextual relationships between words in a sentence. BERT proposed bidirectional training to learn from both directions simultaneously. This approach allowed BERT to capture abundant contextual information and improve performance on a wide range of NLP tasks training process consists of two major steps: pretraining and fine-tuning. In pretraining, the model is trained on a large corpus using two unsupervised tasks—masked language modeling (MLM) and next sentence prediction (NSP) In MLM, random words in a sentence are masked, and the model is tasked with predicting the missing words. NSP trains the model to understand relationships between sentences. GPT is a unidirectional model that processed text from left to right GPT is trained by an auto-regressive approach to predict the next word in a sequence based on the previous words Larger models not only improve performance on language tasks but also exhibit emergent capabilities that are absent in smaller models.
10.3 Key Players in the LLM Landscape
10.3.1 ChatGPT by OpenAI (Current Version: GPT-4)
breakthrough in LLMs and NLP. These models are designed to generate human-like text by leveraging a Transformer-based architecture and vast datasets. Each generation from GPT-1 to recent GPT-4 has demonstrated increasing complexity, performance, and applicability levels. pretrained on a diverse corpus of internet data including articles, books, websites, and other content available for public 10.3.2 Pathways Language Model (PaLM) by Google DeepMind (Current Version: PaLM 2)
superseded by Gemini10.3.3 Large Language Model Meta AI (LLaMA) by Meta (Current Version: LLaMA 2)
One of its motivations is to develop a model that can match or exceed performance like GPT-3’s with lower computational overheads cost. larger datasets and well-tuned training strategies can be achieved with fewer parameters. 10.3.4 Claude by Anthropic (Current Version: Claude 2)
LLM designed for safety, alignment, and usability prioritizes ethical considerations Constitutional AI Framework: This approach embeds ethical constraints into the model’s learning process. Claude is trained using a set of guiding principles (a “constitution”) to evaluate and correct responses without extensive post-deployment modification. 10.3.5 ERNIE 3.0 Titan by Baidu
trained in Chinese and English combines auto-regressive (like GPT) or auto-encoding (like BERT) architecture
10.4 Applications of LLMs in GenAI
10.4.1 Creative Writing and Content Generation
10.4.2 Language Translation
10.4.3 Conversational AI and Chatbots
10.4.4 Text Summarization and Content Curation
10.5 Ethical Considerations and Challenges
10.5.1 Detecting and Mitigating Bias
begins with conscientious curation of training data. algorithmic fairness techniques such as adversarial training, where a secondary model is used to detect and correct biased outputs from the primary model Human-in-the-loop systems 10.5.2 Privacy and Data Security
10.5.3 The Spread of Misinformation
Verifying information generated by black-box LLMs is particularly challenging because they do not provide sources for their outputs.10.5.4 Ethical Guidelines for LLM Deployment
10.6 Future Outlook and Research
10.6.1 Current Trends in LLMs and GenAI
Multimodal models are designed to integrate text, images, audio, video, and progress beyond just textual data. LLMs have advanced significantly from GPT-2 with 1.5 billion parameters to GPT-4 with hundreds of billions of parameters. Specialized LLMs are domain-specific models that can be fine-tuned for industries such as medicine, law, or finance. Few-shot and zero-shot learning refer to LLMs’ capabilities to perform tasks with minimal or no task-specific data. 10.6.2 The Future of Creativity in AI
10.6.3 The Role of LLMs in AI Ethics
10.6.4 The Path Forward: Research and Development
Errata
10.1.1- It generates human-like text ranging from essays composition to code snippets creation with over 175 billion parameters to capture intricate linguistic nuances than lesser models’ endeavors. Fig 10.1- What does the vertical axis mean? 10.2.1-
- Neural networks are the foundation of machine learning and NLP in LLMs development to mimic the cognitive functions of the human brain.
- Multi-Layer Perceptron (MLP) is one of the earliest neural networks restrained by capturing temporal dependencies in data for language tasks.
- Convolutional Neural Networks (CNNs) primarily used in computer vision tasks have accessed to NLP through sentence classification and character-level modeling applications
14. Workshop #4 Semantic Analysis and Word Vectors Using spaCy
14.1 Introduction
14.2 What Are Word Vectors?
A word vector is a dense representation of a word.
14.3 Understanding Word Vectors
Or word2vec14.3.1 Example: A Simple Word Vector
14.4 A Taste of Word Vectors
14.5 Analogies and Vector Operations
14.6 How to Create Word Vectors?
- word2vec (Google, 2013)
- GloVe (Stanford, 2014)
- fastText (Facebook, 2015)
14.7 spaCy Pretrained Word Vectors
14.8 Similarity Method in Semantic Analysis
14.9 Advanced Semantic Similarity Methods with spaCy
14.9.1 Understanding Semantic Similarity
14.9.2 Euclidean Distance
Distance between 2 points on a graph. √(x1-x2)2 + (y1-y2)2 14.9.3 Cosine Distance and Cosine Similarity
Cosine distance is more concerned with the orientation (angles) of two vectors in the space. Similarity score ranges from 0 to 1. 14.9.4 Categorizing Text with Semantic Similarity
14.9.5 Extracting Key Phrases
A noun phrase (NP) is a group of words that consist of a noun and its modifiers. Modifiers are usually pronouns, adjectives, and determiners.14.9.6 Extracting and Comparing Named Entities
Errata
- 14.3- Word vectors, or word2vec are important quantity units in statistical methods to represent text in statistical NLP algorithms.- According to Wikipedia, word2vec is a algorithm for creating vectors via neural networks. Not statistical models. It was developed at Google in 2013. The author mentions this in 14.6.
- I don't think the author does a good job of explaining word vector matrices. He points out that each row represents a word, but doesn't explain what the columns represent. I found better explantions at dzone and acolyer.com
- Spacy is introduced in ch12.
- Similar words do not assign with similar vectors resulting unmeaningful vectors...
- 14.6
- word2vec is a name of statistical algorithm created by Google to produce word vectors. word2vec creates models that train neural nets.
- GloVe is short for Global Vectors. Not Glove vectors.
15 Workshop #5: Sentiment Analysis and Text Classification (Hour 9–10)
15.1 Introduction
NLTK and spaCy are two major NLP Python implementation tools for basic text processing, N-gram modeling, POS tagging, and semantic analysis
- Study text classification concepts in NLP and how spaCy NLP pipeline works on text classifier training.
- Use movie reviews as a problem domain to demonstrate how to implement sentiment analysis with spaCy.
- Introduce Artificial Neural Networks (ANN) concepts, TensorFlow, and Kera technologies.
- Introduce sequential modeling scheme with LSTM technology using movie reviews domain as example to integrate these technologies for text classification and movie sentiment analysis.
15.2 Text Classification with spaCy and LSTM Technology
TextCategorizer is a spaCy‘s text classifier component applied in dataset for sentiment analysis to perform text classification with two vital Python frameworks (1) TensorFlow Keras API and (2) spaCy technology.
15.3 Technical Requirements
15.4 Text Classification in a Nutshell
15.4.1 What Is Text Classification?
Text Classification is the task of assigning a set of predefined labels to text.
- Language detection is the first step of many NLP systems, i.e., machine translation.
- Topic generation and detection are the process of summarization, or classification of a batch of sentences into certain topic of interest (TOI) or topic titles
- Sentiment analysis to classify or analyze users’ responses, comments, and messages on a particular topic attribute to positive, neutral, or negative sentiments.
15.4.2 Text Classification as AI Applications
Text classification is considered as supervised-learning (SL) task
- Binary text classification refers to categorizing text into two classes.
- Multi-class text classification refers to categorizing texts with more than two classes.
- Multi-label text classification system is to generalize its multi-class counterpart assigned to each example text (e.g., toxic, severe toxic, threat)
15.5 Text Classifier with spaCy NLP Pipeline
TextCategorizer (tCategorizer) is spaCy‘s text classifier component. It required class labels and examples in NLP pipeline to perform training procedure.
15.5.1 TextCategorizer Class
TextCategorizer consists of (1) single-label and (2) multi-label classifiers.15.5.2 Formatting Training Data for the TextCategorizer
15.5.3 System Training
15.5.4 System Testing
15.5.5 Training TextCategorizer for Multi-Label Classification
15.6 Sentiment Analysis with spaCy
15.6.1 IMDB Large Movie Review Dataset
15.6.2 Explore the Dataset
15.6.3 Training the TextClassfier
15.7 Artificial Neural Network in a Nutshell
This workshop section will learn how to incorporate spaCy technology with ANN technology using TensorFlow and its Keras package.
A typical ANN has
- Input layer consists of input neurons, or nodes
- Hidden layer consists of hidden neurons, or nodes
- Output layer consists of output neurons, or nodes
15.8 An Overview of TensorFlow and Keras
TensorFlow is a popular Python tool widely used for machine learning. Kerasis a Python-based deep learning tool that can be integrated with Python platforms such as TensorFlow, Theano, and CNTK.
15.9 Sequential Modeling with LSTM Technology
Keras has extensive support for RNN variations GRU, LSTM and simple API for training RNNs.
15.10 Keras Tokenizer in NLP
Tokens are vectorized by the following steps:
- Tokenize each utterance and turn these utterances into a sequence of tokens.
- Build a vocabulary from set of tokens presented in Step 1. These are tokens to be recognized by neural network design.
- Create a vocabulary and assign ID to each token.
- Map token vectors with corresponding token-IDs.
15.10.1 Embedding Words
Tokens can be transformed into token vectors. Embedding tokens into vectors occurred via a lookup embedding table.
- Each utterance is divided into tokens and built a vocabulary with Keras‘Tokenizer.
- The Tokenizer object held a token index with a token->token-ID mapping.
- When a token-ID is obtained, lookup to embedding table rows with this token-ID to acquire a token vector.
- This token vector is fed to neural network.
15.11 Movie Sentiment Analysis with LTSM Using Keras and spaCy
This section will demonstrate the design of LSTM-based RNN text classifier for sentiment analysis with steps below:
- Data retrieval and preprocessing.
- Tokenize review utterances with padding.
- Create utterances pad sequence and put it into input layer.
- Vectorize each token and verify by token-ID in embedding layer.
- Input token vectors into LSTM.
- Train LSTM network.
Step 1: Dataset Step 2: Data and vocabulary preparation Step 3: Implement the Input Layer Step 4: Implement the Embedding Layer Step 5: Implement the LSTM Layer Step 6: Implement the Output Layer Step 7: System Compilation Step 8: Model Fitting and Experiment Evaluation