Definitions
| Bigram | 2 words. N=2. I am. He is. We are. Used frequently. |
|---|---|
| Bigram normalization | division of each bigram counts by appropriate unigram counts for wn-1 |
| Corpus | a collection of written texts |
| Language model | traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence |
| Lemma | Basic word shared by word forms. In the dictionary. |
| Lemmatization | Get the lemma. Slow. Am, is, are, was, were → Be |
| N-gram | statistical model consisting of a sequence of N words. |
| Quadrigram | 4 words. N=4. |
| Sentence | unit of written language; domain specific |
| Stem | Root of a word. Might not be in the dictionary. |
| Stemming | Get the words root. Fast. Student, studious, study → Stud |
| Tokens | can be meaningful words or symbols, punctuations, or distinct characters |
| Tuple | Ordered, immutable collection of elements. |
| Trigram | 3 words. N=3. I am good. See spot run. We go there. Used rarely. |
| Types/Word Types | distinct words in a corpus |
| Unigram | 1 word. N=1. Seldom used. |
| Utterance | unit of spoken language; domain and culture specific |
| Word Form | another basic entity in a corpus |
Abbreviations
| ASR | Automatic Speech Recognition |
|---|---|
| ATN | Augmented Transition Network |
| BERT | Bidirectional Encoder Representations |
| CDD | Conceptual Dependency Diagram |
| CFG | Context Free Grammar |
| CFL | Context Free Language |
| CNN | Convolutional Neural Networks |
| CR | Coreference Resolution |
| DT | Determiner |
| FOPC | First Order Predicate Calculus |
| GPT | Generative Pretrained Transformer |
| GRU | Gate Recurrent Unit |
| HMM | Hidden Markov Model |
| IE | Information Extraction |
| IR | Information Retrieval |
| LLM | Large Language Model |
| LME | Language Model Evaluation |
| LSTM | Long Short Term Memory |
| MEMM | Maximum Entropy Markov Model |
| MLE | Maximum Likelihood Estimation |
|---|---|
| MeSH | Medical Subject Headings |
| NER | Named Entity Recognition |
| NLTK | Natural Language Toolkit |
| NLU | Natural Language Understanding |
| NN | Noun |
| NNP | Proper Noun |
| Nom | Nominal |
| NP | Noun Phrase |
| PCFG | Probabilistic Context Free Grammar |
| PMI | Point-Wise Mutual Information |
| POS | Part Of Speech |
| POST | Part Of Speech Tagging |
| PPMI | Positive Point-wise Mutual Information |
| RNN | Recurrent Neural Network |
| TBL | Transformation Based Learning |
| VB | Verb |
| VP | Verb Phrase |
| WSD | Word Sense Disambiguation |
1. Natural Language Processing
1.1 Introduction
Modern chatbots represent human-computer interaction and require world knowledge and domain knowledge.
Knowledge is organized into knowledge trees or ontology graphs.
NLP is cross-disciplinary integration of philosophy, psychology, linguistics, and computational linguistics.
computational linguistics - multidisciplinary study of epistemology, philosophy, psychology, cognitive science, and agent ontology.
1.2 Human Language and Intelligence
Core technologies and methodologies arose from the Turing Test.
Human language is categorized as written and oral.
1.3 Linguistic Levels of Human Language
6 (7?) levels of linguistics analysis: phonetics phonology, morphology, lexicology, syntax, semantics, and pragmatics.
1.4 Human Language Ambiguity
- Lexical ambiguity- words can have different meanings
- Syntactic- who is holding the bag?
- Semantic- what does "it" mean in a sentence?
- Pragmatic- what does the sentence mean?
1.5 A Brief History of NLP
1.5.1 First Stage: Machine Translation (Before the 1960s)
Leibniz and Descartes codified relationships between words and sentences.
Alan Turing wrote Computing Machinery and Intelligence in 1950 which proposed the Turing test.
The Georgetown-IBM experiment translated over 60 Russian sentences in 1957.
Chomsky wrote Syntactic Structures in 1957.
The ALPAC report threw cold water on NLP research in 1966.
1.5.2 Second Stage: Early AI on NLP (1960s-1970s)
The Baseball Q&A program was developed in 1961.
Marvin Minsky wrote Semantic Information Processing in 1968.
William Woods developed augmented transition networks (ATN) in 1970.
1.5.3 Third Stage: Grammatical Logic on NLP (1970s-1980s)
1.5.4 Fourth Stage: AI and Machine Learning (1980s-2000s)
IBM began developing Watson, a DeepQA program which could compete on Jeopardy.
1.5.5 Fifth Stage: Rise of BERT, Transformer, ChatGPT, and LLMs (2000s-Present)
Long short-term memory (LSTM) recurrent neural networks (RNN) became dominant.
Google Brain published Attention Is All You Need which introduced the transformer architecture.
Google introduced BERT.
OpenAI developed ChatGPT which utilizes generative pre-trained transformers.
1.6 NLP and AI
1.7 Main Components of NLP
NLP consists of: Natural Language Understanding (NLU), Knowledge Acquisition and Inferencing (KAI), and Natural Language Generation (NLG).
1.8 Natural Language Understanding (NLU)
NLU is a process of understanding spoken language in four stages: speech recognition, and syntactic, semantic, and pragmatic analysis.
1.8.1 Speech Recognition
Speech recognition is the first stage in NLU that performs phonetic, phonological, and morphological processing to analyze spoken language.
Breaks down the stems of spoken words (utterances) into distinct tokens representing paragraphs, sentences, and words in different parts.
Current speech recognition models use spectrogram analysis.
1.8.2 Syntax Analysis
Reject phrases like, "I you love."
1.8.3 Semantic Analysis
Reject nonsense like, "hot snowflakes".
1.8.4 Pragmatic Analysis
Requires expert knowledge or just common sense.
1.9 Potential Applications of NLP
NLP is used for translation, information extraction (IE), information retrieval (IR), sentiment analysis, and chatbots.
1.9.1 Machine Translation (MT)
Earliest NLP application.
Although it is not difficult to translate one language to another...(?!)
1.9.2 Information Extraction (IE)
1.9.3 Information Retrieval (IR)
1.9.4 Sentiment Analysis
1.9.5 Question-Answering (Q&A) Chatbots
2. N-Gram Language Model
2.1 Introduction
- NLP entities like word-to-word tokenization using NTLK and spaCy technologies in Workshop 1 (Chap. 11) analyzed words in insolation
- In many NLP applications, there are noises and disruptions affecting incorrect word pronunciation
- Humans experience mental confusion about spelling errors
- Word prediction can provdie spell checking and model words relationships
- Probability or word counting method can work on a large databank called a corpus
2.2 N-Gram Language Model
2.2.1 Basic NLP Terminology
| Sentence | unit of written language; domain specific |
|---|---|
| Utterance | unit of spoken language; domain and culture specific |
| Word Form | another basic entity in a corpus |
| Types/Word Types | distinct words in a corpus |
| Tokens | can be meaningful words or symbols, punctuations, or distinct characters |
| Stem | Root of a word. Might not be in the dictionary. |
| Stemming | Get the words root. Fast. Student, studious, study → Stud |
| Lemma | Basic word shared by word forms. In the dictionary. |
| Lemmatization | Get the lemma. Slow. Am, is, are, was, were → Be |
| Over a trillion English tokens with over a million meaningful wordform types sufficient to generate sentences/utterances for daily use | |
| Brown Corpus | First well-organized corpus. Brown University. 1961. Over 583 million tokens, 293,181 wordform types and foreign words. |
| Wall Street Journal | Financial domain |
| Associated Press | International news |
| Hansard | British parliamentary speeches |
| BU Broadcast News Corpus | |
| NLTK Corpus |
A language model is a traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence.
2.2.2 Language Modeling and Chain Rule
2.3 Markov Chain in N-Gram Model
A Markov chain is a process that describes a sequence of possible events where the probability of each event depends only the previous event.
Entire sentences can be modelled as Markov chains.
2.4 Example: The Adventures of Sherlock Holmes
Maximum Likelihood Estimates (MLE) is another method to calculate the N-gram model.
Bigram normalization is the division of bigram counts by unigram counts for wn-1.
2.5 Shannon’s Method in N-Gram Model
- Choose a random N-gram (<s>, w) according to its probability
- Now choose a random N-gram (w, x) according to its probability
- And so on until we choose </s>
- String the words together into a sentence
Constructing sentences from trigrams seems to be most suitable.
2.6 Language Model Evaluation and Smoothing Techniques
Language Model Evaluation (LME) is a standard method to train parameters on a training set and to review model performance with new data constantly.
That often occurred in real world to learn how the models perform called training data (training set) on language model and see whether it works with unseen information called test data (test set).
Can deal with unkown words.
2.6.1 Perplexity
Perplexity (PP) is the probability of the test set assigned by the language model, normalized by the number of words.
minimizing perplexity is the same as maximizing probability for model performance
2.6.2 Extrinsic Evaluation Scheme
Simply compare two models’ application performance regarding N-gram evaluation.
Time consuming
2.6.3 Zero Counts Problems
Many possible bigrams never show up in training data.
Zipf's Law- in a large corpus, the frequency of any word is inversely proportional to its rank, where the most frequent word occurs twice as often as the second, and thrice as often as the third.
2.6.4 Smoothing Techniques
2.6.5 Laplace (Add-One) Smoothing
2.6.6 Add-k Smoothing
2.6.7 Backoff and Interpolation Smoothing
2.6.8 Good Turing Smoothing
3. Part-of-Speech (POS) Tagging
3.1 What Is Part of Speech (POS)?
- Category of words that have similar grammatic behaviors or properties.
- Inflection- items are added to the base form of a word to convey grammatical meanings (eg, cat and cats)
3.1.1 Nine Major POS in the English Language
- adjectives
- verbs
- pronouns
- conjunctions
- prepositions
- articles
- adverbs
- nouns
- interjections
3.2 POS Tagging
3.2.1 What Is POS Tagging in Linguistics?
Labelling a word according to a particular POS based on definition and contexts in linguistics.
3.2.2 What Is POS Tagging in NLP?
Automatic description assignment to words or tokens.
The operation of converting a sentence to forms, or a list of words and list of tuples, where each tuple has a word or tag form to signify POS.
3.2.3 POS Tags Used in the PENN Treebank Project
POS tag databank provided by the PENN Treebank corpus
classifies nine major POS into subclasses that have a total of 45 POS tags
Penn Treebank (PTB) corpus has a comprehensive section of WSJ articles
3.2.4 Why Do We Care About POS in NLP?
3.3 Major Components in NLU
- Morphology- understandings of shapes and patterns for every word of a sentence.
- POS tagging
- Syntax
- Semantics
- Discourse integration- relationship between different sentences and its contents.
3.3.1 Computational Linguistics and POS
POS tagging can be considered as the fundamental process in computational linguistics
3.3.2 POS and Semantic Meaning
3.3.3 Morphological and Syntactic Definition of POS
3.4 Nine Key POS in English
- pronoun
- verb
- adjective (quick)
- adverb (quickly)
- interjection (Hey!)
- noun
- conjunction (and, or, but)
- preposition (in, on, at)
- article (a, the)
3.4.1 English Word Classes
Two types of English word classes: closed and open .
Closed-class words are also known as functional/grammar words. They are closed since new words are seldom created in the class. (conjunctions, determiners, pronouns, and prepositions)
Open-class words are also known as lexical/content words. They are open because the meaning of open-class words can be found in dictionaries and therefore their meaning can be interpreted individually. (nouns, verbs, adjectives, and adverbs)
3.4.2 What Is a Preposition?
used before nouns
approximately 80-100 prepositions in English
3.4.3 What Is a Conjunction?
Coordinating conjunctions join words, clauses, or phrases of equal grammatic rank. (and, but, for, nor, or yet)
Subordinating conjunctions join independent and dependent clauses to present a relationship (as, although, because, since, though, while, whereas)
3.4.4 What Is a Pronoun?
3.4.5 What Is a Verb?
3.5 Different Types of POS Tagset
3.5.1 What Is Tagset?
A Tagset is a batch of POS tags (POST) to indicate the part of speech and sometimes other grammatical categories such as case and tense
3.5.2 Ambiguous in POS Tags
- Noun-verb ambiguity. (record)
- Adjective-verb ambiguity. (perfect)
- Adjective-noun ambiguity. (complex)
3.5.3 POS Tagging Using Knowledge
- dictionary
- morphological rules (capitalization, suffixes...)
- N-gram frequencies (next word prediction)
- structural relationships combination. (figure it out)
3.6 Approaches for POS Tagging
3 basic approaches to POS Tagging: Rule-based, Stochastic-based, and Hybrid Tagging
3.6.1 Rule-Based Approach POS Tagging
two-stage process: (1) dictionary consists of all possible POS tags (2) words with more than single tag ambiguity applied handwritten or grammatic rules to assign the correct tag(s) according to surrounding words
rule generation can be achieved by (1) hand creation and (2) training from a corpus with machine learning.
3.6.2 Example of Rule-Based POS Tagging
3.6.3 Example of Stochastic-Based POS Tagging
3.6.4 Hybrid Approach for POS Tagging Using Brill’s Taggers
3.6.5 What Is Transformation-Based Learning?
3.6.6 Hybrid POS Tagging: Brill’s Tagger
3.6.7 Learning Brill’s Tagger Transformations
3.7 Taggers Evaluations
3.7.1 How Good Is an POS Tagging Algorithm?
Weird stuff
NLP entities like word-to-word tokenization using NTLK and spaCy technologies in Workshop 1 (Chap. 11) analyzed words in insolation, but the relationship between words is important in NLP.
When applying probability to word prediction in an utterance, there are words often proposed by rank and frequency to provide a sequential optimum estimation.
It was learned that the motivations for word prediction can apply to voice recognition, text generation, and Q&A chatbot.
A language model is a traditional word counting model to count and calculate conditional probability to predict the probability based on a word sequence...
Author doesn't explain the difference between the conditional probablity formula and the chain rule.
Further, Zipf's law states that a long tail phenomenon is a rare event that occurs in a very high frequency, and large events numbers occur in a low frequency constantly.
This approach seems fair and easy to understand... Says you.
ch3
Doesn't define tuples
As human language is natural and the most polytropic? means of communication...
3.5.2 Ambiguous? in POS Tags
It is a two-stage process: (1) dictionary consists of all possible POS tags for basic concepts of words as abovementioned and (2) words with more than single tag ambiguity applied handwritten or grammatic rules to assign the correct tag(s) according to surrounding words.
4. Syntax and Parsing
4.1 Introduction and Motivation
Grammatic rules are used to create sentences with syntactic rules.
Syntax analysis is used to analyze the structure and the relationship between tokens to create a parse tree.
4.2 Syntax Analysis
4.2.1 What Is Syntax
Syntax is the set of rules that govern how groups of words are combined to form phrases, clauses, and sentences.
Syntax can be defined as the correct arrangement of word tokens in written or spoken sentences
4.2.2 Syntactic Rules
There are seven common syntactic patterns:
- Subject → Verb
- Subject → Verb → Direct Object
- Subject → Verb → Subject Complement
- Subject → Verb → Adverbial Complement
- Subject → Verb → Indirect Object → Direct Object
- Subject → Verb → Direct Object → Direct Complement
- Subject → Verb → Direct Object → Adverbial Complement
4.2.3 Common Syntactic Patterns
4.2.4 Importance of Syntax and Parsing in NLP
4.3 Types of Constituents in Sentences
4.3.1 What Is Constituent?
4.3.2 Kinds of Constituents
4.3.3 Complexity on Simple Constituents
4.3.4 Verb Phrase Subcategorization
4.3.5 The Role of Lexicon in Parsing
4.3.6 Recursion in Grammar Rules
4.4 Context-Free Grammar (CFG)
4.4.1 What Is Context Free Language (CFL)?
4.4.2 What Is Context Free Grammar (CFG)?
4.4.3 Major Components of CFG
4.4.4 Derivations Using CFG
4.5 CFG Parsing
4.5.1 Morphological Parsing
4.5.2 Phonological Parsing
4.5.3 Syntactic Parsing
4.5.4 Parsing as a Kind of Tree Searching
4.5.5 CFG for Fragment of English
4.5.6 Parse Tree for “Play the Piano” for Prior CFG
4.5.7 Top-Down Parser
4.5.8 Bottom-Up Parser
4.5.9 Control of Parsing
4.5.10 Pros and Cons of Top-Down vs. Bottom-Up Parsing
4.6 Lexical and Probabilistic Parsing
4.6.1 Why Using Probabilities in Parsing?
4.6.2 Semantics with Parsing884.6.3 What Is PCFG?
4.6.4 A Simple Example of PCFG
4.6.5 Using Probabilities for Language Modelling
4.6.6 Limitations for PCFG
4.6.7 The Fix–Lexicalized Parsing
5 Meaning Representation
5.1 Introduction
This chapter will focus on how to interpret meaning, introducing scientific and logical methods for processing meaning, known as meaning representation.
5.2 What Is Meaning?
Meaning refers to the message conveyed by words, phrases, and sentences or utterances within a given context.
Often referred to as lexical or semantic meaning.
Meanings of sentences are not simply the combination of words' meanings, but usually the phrasal words with specific meanings at the pragmatic level (off the wagon).
Semantic meaning is the study of meaning assignment to minimal meaning-bearing elements.
5.3 Meaning Representations
Unlike parse trees, meanning representations are not primarily a description of the input structure, but a representation of how humans understand, represent anything (such as actions, events, objects, etc.).
5 types of meaning representation:
- categories- specific entitities
- events
- time- specific moment
- aspects
- Stative- facts
- Activity
- Accomplishment?
- Achievement?
- beliefs, desires, and intentions
5.4 Semantic Processing
Semantic processing encodes and interprets meaning.
- Reason relations with the environment.
- Answer questions based on contents.
- Perform inference based on knowledge and determine the verity of unknown facts.
5.5 Common Meaning Representation
4 common methods of meaning representation.
5.5.1 First-Order Predicate Calculus (FOPD)
Also known as predicate logic.
Expresses the relationship between information objects as predicates.
5.5.2 Semantic Networks
Knowledge representation techniques used for propositional information. A semantic net can be represented as a labeled directed graph.5.5.3 Conceptual Dependency Diagram (CDD)
Theory that describes how sentence/utterance meaning is represented for reasoning.5.5.4 Frame-Based Representation
Frames and notions as basic components to characterize domain knowledge.
A frame is a knowledge configuration to characterize a concept such as a car or driving a car. (like an object with properties)
5.6 Requirements for Meaning Representation
5.6.1 Verifiability
determining whether a sentence/utterance has a literal meaning5.6.2 Ambiguity
word, statement, or phrase that consists of more than one meaning5.6.3 Vagueness
borderline cases (short, tall,...)5.6.4 Canonical Forms
5.6.4.1 What Is Canonical Form?
Simplest expression. Instead of 0.000, just say 0 .5.6.4.2 Canonical Form in Meaning Representation
Instead of "Jack likes to eat candy all day.", just say "Jack eats candy."5.6.4.3 Canonical Forms: Pros and Cons
Advantages- 1)Simplify reasoning and storage operations. 2)No need to generate inference rules for all different variations.
May complicate semantic analysis.
5.7 Inference
5.7.1 What Is Inference?
deduction and induction
Induction goes from specific to general; Deduction goes from general to specific.
Induction builds theories. Deduction tests them.
5.7.2 Example of Inferencing with FOPC
5.8 Fillmore’s Theory of Universal Cases
Case grammar is a linguistic system that focuses on the connection between the quantity such as the subject, object, or valence of a verb and the grammatical context.
Only a limited number of semantic roles (case roles) occur in every sentence constructed with verbs.
5.8.1 What Is Fillmore’s Theory of Universal Cases?
Each verb needs a certain number of case roles to form a case-frame
Thematic role (semantic role) refers to case role that a noun phrase (NP) may deploy with respect to action or state used by the main verb.
5.8.2 Major Case Roles in Fillmore’s Theory
- Agent—doer of action, attribute intention.
- Experiencer—doer of action without intention.
- Theme—thing that undergoes change or being acted upon with
- Instrument—tool being used to perform the action.
- Beneficiary—person or thing for which the action being acted on or performed to.
- To/At/From Loc/Poss/Time—to possess things, place, or time.
5.8.3 Complications in Case Roles
4 types of complications in case role analysis:
- Syntactic constituents’ ability to indicate semantic roles in several cases
- Syntactic expression option availability
- Prepositional ambiguity not always introduces the same role
- Role options in a sentence
5.8.3.1 Selectional Restrictions
Selectional restrictions are methods to restrict types of certain roles to be used for semantic consideration.
5.9 First-Order Predicate Calculus
5.9.1 FOPC Representation Scheme
FOPC can be used as a framework to derive semantic representation of a sentence.
FOPC supports:
- Reasoning in truth condition analysis to respond yes or no questions.
- Variables in general cases through variable binding at responses and storage.
- Inference to respond beyond KB storage on new knowledge.
5.9.2 Major Elements of FOPC
- terms
- constants- specific object described in sentence
- functions- concepts expressed as genitives (owner) such as brand name, location,
- variables- objects without reference which object is referred to
- predicates- property of a subject; usually verbs
- connectives- conjunctions, implications(if...then), negations
- quantifiers
5.9.3 Predicate-Argument Structure of FOPC
5.9.4 Meaning Representation Problems in FOPC
5.9.5 Inferencing Using FOPC
Inference- validate or prove whether a proposition is true or false from a KB.
Modus Ponens (MP)- If P is true, then Q must also be true. P→Q
Errata
- 5.3- what's the difference between an accomplishment and an achievement?
- 5.8- Case grammar is a linguistic system that focuses on the connection between the quantity(?) such as the subject, object, or valence of a verb and the grammatical context.
- Jack gets a prize. Statement shows Jack is the agent as he is doer to get, the prize is the object being received, so it is a patient. What is patient?