Notes on NLP

38 minute read

Published:

Comprehensive study notes covering foundational NLP methods from classical n-gram language models and HMMs through neural architectures including word embeddings, RNNs, and attention mechanisms.

Contents

  1. N-gram LMs revisited / Sequence Tagging: HMMs
  2. Viterbi, Named Entity Recognition (NER)
  3. HMMs for POS tagging (recap)
  4. MEMMs: HMMs vs MEMMs
  5. Text Classification
  6. Word Similarity and Embeddings
  7. Feed-forward Neural Networks (FFNNs)
  8. Training FFNNs
  9. Recurrent Neural Networks (RNNs)
  10. Encoder–Decoder Architecture
  11. Transformers (Work in Progress)
  12. Transformer-based Models (Work in Progress)

N-gram LMs revisited / Sequence Tagging: HMMs

Language Models (LMs)

A language model assigns probabilities to sequences and to next-word predictions:

  • Sequence probability: \(P(w_1 w_2 \dots w_n)\)
  • Next-word distribution: \(P(w_n \mid w_1 w_2 \dots w_{n-1})\)

Vocabulary + dataset

  • Vocabulary (finite set): \(\mathcal{V} = \{\texttt{the}, \texttt{a}, \texttt{man}, \dots\}\)
  • All possible sequences (infinite): \(\mathcal{V}^*\)
  • Training corpus: \(\mathcal{D} = \{\mathbf{w}^{(m)}\}_{m=1}^M\)

Naïve empirical distribution (doesn’t generalize)

A naïve approach is the empirical distribution:

\[P(\mathbf{w}) = \frac{c(\mathbf{w})}{M}\]

This fails to generalize to valid sequences that never appear in the training set.

Chain rule decomposition

Using the chain rule:

\[P(w_1 w_2 \dots w_n) = \prod_{i=1}^{n} P\big(w_i \mid w_1\dots w_{i-1}\big)\]

Example:

\[P(\text{I saw a man}) = P(\text{I})\,P(\text{saw}\mid\text{I})\,P(\text{a}\mid\text{I saw})\,P(\text{man}\mid\text{I saw a})\]

Markov assumption → n-gram LMs

Key idea (Markov assumption): approximate each conditional with only the last \((N-1)\) words.

  • Unigram: \(P(w_i)\)
  • Bigram: \(P(w_i\mid w_{i-1})\)
  • Trigram: \(P(w_i\mid w_{i-2}, w_{i-1})\)

Example query:

  • \[P(\text{lost} \mid \text{Not all those who wander are})\]
    • unigram: \(P(\text{lost})\)
    • bigram: \(P(\text{lost}\mid\text{are})\)
    • trigram: \(P(\text{lost}\mid\text{wander are})\)

Bigram sequence probability

With implicit start token \(\langle s \rangle\):

\[P(w_1\dots w_n) \approx \prod_{i=1}^{n} P\big(w_i \mid w_{i-1}\big),\quad w_0 := \langle s \rangle\]

Learning an n-gram model (MLE / “raw counts”)

For a bigram LM, MLE estimates:

\[\hat P(w_i\mid w_{i-1}) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}\]

You need:

  • unigram counts \(c(w)\)
  • bigram counts \(c(w_{i-1}, w_i)\)

Worked toy example

Training corpus:

<s> I get what I eat and
I eat what I get </s>

From counts, the slide derives:

\[P(\text{what} \mid \text{get}) = \tfrac{1}{2}\]

And:

\[P(\langle s\rangle\;\text{I get what}) = P(\text{I}\mid\langle s\rangle)\,P(\text{get}\mid\text{I})\,P(\text{what}\mid\text{get}) = 1\cdot\frac{2}{4}\cdot\frac{1}{2} = 0.25\]

Note: the model can assign non-zero probability to some unseen sentences, but any unseen bigram makes the full sequence probability \(0\) → motivates smoothing.

Practical issues when building n-gram LMs

Types vs. tokens

  • Tokens = word instances in running text
  • Types = distinct word forms in the vocabulary

Counting & normalization choices

  • Tokenization decisions (punctuation, contractions)
  • Language-dependent tokenization
  • Text normalization (e.g., case-folding) depends on downstream task

Engineering tips

  • Compute in log space to avoid underflow and to replace products with sums.
  • If space is an issue, don’t explicitly store 0-count n-grams; treat missing entries as count 0.

Generating text by sampling from an LM

For a bigram LM:

  1. Sample a bigram starting with \(\langle s\rangle\); output the second word.
  2. Repeat: sample a bigram conditioned on the last generated word.
  3. Stop at end token or another stopping criterion.

Part-of-Speech (POS) tagging

Goal: assign a POS tag to each token in a sentence (sequence tagging).

Example:

The/DT planet/NN Jupiter/NNP and/CC its/PPS moons/NNS are/VBP in/IN effect/NN a/DT mini-solar/JJ system/NN ./.

Why it’s hard:

  • Many common words are ambiguous (e.g., book noun vs verb)
  • Ambiguity is frequent in running text

Quick notes

  • Penn Treebank tagset has 45 tags (example shown on slides).
  • A strong baseline: choose the most frequent tag per word (~92% accuracy).
  • SOTA methods reach ~98% accuracy.

HMMs for sequence tagging (intro)

We want the best tag sequence \(t_1\dots t_n\) for words \(w_1\dots w_n\):

\[\arg\max_{t_1\dots t_n} P(t_1\dots t_n \mid w_1\dots w_n)\]

Using Bayes (denominator ignored for argmax):

\[\arg\max_{t} P(w_1\dots w_n \mid t_1\dots t_n)\,P(t_1\dots t_n)\]

Assumptions:

  • Tag sequence model (Markov / n-gram): e.g. bigram \(P(t_1\dots t_n) \approx \prod_{i=1}^n P(t_i\mid t_{i-1})\)
  • Emissions (conditional independence): \(P(w_1\dots w_n\mid t_1\dots t_n) \approx \prod_{i=1}^n P(w_i\mid t_i)\)

So the score factors into transition and emission probabilities.

Brute-force decoding is impossible

If sentence length \(m=20\) and tagset size \(T=15\), then there are \(T^m = 15^{20}\) possible tag sequences → too many to enumerate. Next class: efficient decoding (Viterbi).

Viterbi, Named Entity Recognition (NER)

HMMs for tagging (POS or BIO tags)

Goal: given a word sequence \(w_{1:n}\), find the most likely tag sequence \(t_{1:n}\):

\[\hat{t}_{1:n} = \arg\max_{t_{1:n}} P(t_{1:n} \mid w_{1:n})\]

Using Bayes (dropping the constant \(P(w_{1:n})\)):

\[\hat{t}_{1:n} = \arg\max_{t_{1:n}} P(w_{1:n} \mid t_{1:n})\,P(t_{1:n})\]

Independence + Markov assumptions (HMM)

  • Emission independence: each word depends only on its own tag
    • \[P(w_{1:n} \mid t_{1:n}) \approx \prod_{i=1}^{n} P(w_i \mid t_i)\]
  • Tag Markov assumption (bigram): each tag depends only on the previous tag
    • \[P(t_{1:n}) \approx \prod_{i=1}^{n} P(t_i \mid t_{i-1})\]

Putting these together, a common HMM score for a path \(t_{1:n}\) is:

\[\text{score}(t_{1:n}) = \prod_{i=1}^{n} P(w_i \mid t_i)\,P(t_i \mid t_{i-1})\]

(Include boundary tags like \(\langle s \rangle\) and \(\langle/ s \rangle\) in practice.)


Why we need Viterbi

A brute-force search enumerates all tag sequences: if there are \(c\) tags and sentence length \(n\), that’s \(c^n\) sequences—impossible for realistic settings.


Viterbi algorithm (dynamic programming)

Key idea: because of the Markov assumption, the best path to a state at time \(i\) only depends on the best paths to previous states at time \(i-1\).

Let \(v_i(j)\) be the score (probability) of the best tag sequence for \(w_{1:i}\) that ends in tag \(j\) at position \(i\).

Initialization

For the first token \(w_1\):

\[v_1(j) = P(t_1=j \mid \langle s \rangle)\, P(w_1 \mid t_1=j)\]

Store a backpointer \(bp_1(j)=\langle s \rangle\).

Recurrence

For \(i \ge 2\):

\[v_i(j) = \Big(\max_{k \in \mathcal{T}} v_{i-1}(k)\,P(t_i=j \mid t_{i-1}=k)\Big)\,P(w_i \mid t_i=j)\]

Backpointer:

\[bp_i(j) = \arg\max_{k \in \mathcal{T}} v_{i-1}(k)\,P(t_i=j \mid t_{i-1}=k)\]

Termination + backtrace

Pick the best final tag and follow backpointers to recover the full \(\hat{t}_{1:n}\). With explicit end state:

\[\hat{t}_n = \arg\max_{j \in \mathcal{T}} v_n(j)\,P(\langle/ s \rangle \mid t_n=j)\]

Then backtrace: \(\hat{t}_{i-1} = bp_i(\hat{t}_i)\).

Complexity

  • Time: \(O(c^2 n)\) for the forward pass + \(O(n)\) for backtrace
  • Space: typically store a \(c \times n\) score table and a \(c \times n\) backpointer table (or optimize memory if you only need the best score)

HMMs as generative models (intuition)

You can view an HMM as a sentence generator:

  1. Start at \(\langle s \rangle\)
  2. Choose next hidden state/tag via \(P(t_i \mid t_{i-1})\)
  3. Emit word \(w_i\) via \(P(w_i \mid t_i)\)
  4. Repeat until \(\langle/ s \rangle\)

Where do transition / emission probabilities come from?

Assume labeled training data gives the “true” tag for each word (and you add boundary tags if missing).

Maximum-likelihood estimates from counts (“raw count” method)

Emission:

\[P(w \mid c) = \frac{\text{Count}(c, w)}{\text{Count}(c)}\]

Transition:

\[P(c' \mid c) = \frac{\text{Count}(c, c')}{\text{Count}(c)}\]

Smoothing (add-\(k\))

“Lack of evidence is not evidence of lack”: unseen events shouldn’t always have probability 0.

A common approach is add-\(k\) smoothing:

\[P(b \mid a) = \frac{\text{Count}(a,b) + k}{\sum_{b'} (\text{Count}(a,b') + k)}\]
  • \(k=1\) is Laplace smoothing; smaller \(k\) is often used in practice.

Named Entity Recognition (NER)

Task: identify spans of text that are named entities (people, organizations, locations, etc.) and assign a type.

Challenges:

  • Multi-word entities (span detection + classification)
  • Ambiguity (e.g., “Washington” can be a person, location, organization, etc.)

BIO tagging scheme for NER

Convert span detection into per-token tagging:

  • \(\text{B-XXX}\): beginning of an entity of type \(\text{XXX}\)
  • \(\text{I-XXX}\): inside an entity of type \(\text{XXX}\)
  • \(\text{O}\): outside any entity

If there are \(n\) entity types, BIO has \(2n+1\) tags (\(n\) B-tags, \(n\) I-tags, and O).


HMMs for NER (BIO tagger)

Use the same HMM machinery as POS tagging:

  • States \(Q\): BIO tags
  • Observations \(O\): word tokens
  • Transitions \(A\): \(P(\text{BIO}i \mid \text{BIO}{i-1})\)
  • Emissions \(B\): \(P(w_i \mid \text{BIO}_i)\)

Then run Viterbi to find the most likely BIO sequence and reconstruct entity spans from the BIO tags.


Takeaways

  • Viterbi gives an efficient way to decode the best tag sequence under an HMM.
  • HMM parameters come from labeled data via count-based estimates (with smoothing).
  • NER can be framed as sequence labeling using BIO tags, and decoded with the same Viterbi algorithm.

1) HMMs for POS tagging (recap)

We model a sentence as an HMM (a probabilistic finite-state machine):

  • States: POS tags (e.g., N, V, ART, P)
  • Transitions: bigram probabilities over tags \(P(t_i \mid t_{i-1})\)
  • Emissions: probability of word given tag \(P(w_i \mid t_i)\)

Decoding objective: Find the tag sequence \(t_{1:N}\) that maximizes the posterior: \(\arg\max_{t_{1:N}} P(t_{1:N} \mid w_{1:N})\)

With the HMM factorization, we equivalently maximize: \(\arg\max_{t_{1:N}} \prod_{i=1}^{N} P(t_i \mid t_{i-1})\,P(w_i \mid t_i)\)


2) Why Viterbi is efficient

Naïvely, we could enumerate every tag sequence (e.g., all assignments for: “students need another break”), but that is exponential in sentence length.

Key idea: the Markov assumption means each decision only depends on the previous tag, so we can do dynamic programming:

  • sweep left-to-right
  • keep only the best partial path ending in each tag at each time step
  • store backpointers to reconstruct the best final sequence

3) The Viterbi algorithm (HMM)

Let:

  • \(J\) = number of states (tags)
  • \(T\) = length of observation sequence (words)
  • \(a_{ij} = P(t=j \mid t=i)\) transition prob
  • \(b_j(o_t) = P(o_t \mid t=j)\) emission prob
  • \(q_0\) start, \(q_F\) end

Step 1 — Initialization

For the first observation \(o_1\): \(v_1(j) = a_{0j}\,b_j(o_1)\) \(vb_1(j) = 0 \quad (1 \le j \le J)\)

Interpretation: the best score of a path that ends in tag \(j\) at position 1.

Step 2 — Recursion (dynamic programming)

For \(t = 2,\dots,T\): \(v_t(j) = \max_{1 \le i \le J} \big[v_{t-1}(i)\,a_{ij}\,b_j(o_t)\big]\) \(vb_t(j) = \arg\max_{1 \le i \le J} \big[v_{t-1}(i)\,a_{ij}\,b_j(o_t)\big]\)

  • \(v_t(j)\) stores the best probability of reaching state \(j\) after seeing \(o_{1:t}\)
  • \(vb_t(j)\) stores the backpointer (which previous state \(i\) achieved that max)

Step 3 — Termination

\(v_T(F) = \max_{1 \le i \le J} \big[v_T(i)\,a_{iF}\big]\) The best final state index is: \(vb_T(F) = \arg\max_{1 \le i \le J} \big[v_T(i)\,a_{iF}\big]\)

Then we reconstruct the best tag sequence by following backpointers from \(q_F\) backwards.

MEMMs recap: HMMs vs MEMMs

HMM tagger (generative)

Given observations \(o_{1:N}\) and tags \(t_{1:N}\), decode with: \(\arg\max_{t_{1:N}} P(t_{1:N} \mid o_{1:N})\) Using the HMM factorization (first-order Markov tags + emissions): \(\arg\max_{t_{1:N}} \prod_i P(o_i \mid t_i)\, P(t_i \mid t_{i-1})\)

Intuition

  • HMMs use emissions \(P(o_i \mid t_i)\) and transitions \(P(t_i \mid t_{i-1})\).
  • If a word is unseen in training, emission probabilities can be uninformative (or zero without smoothing).

MEMM tagger (discriminative)

MEMM assumption:

  • tag \(t_i\) depends on \(t_{i-1}\), but may condition on the entire input \(o_{1:N}\).

Decode: \(\arg\max_{t_{1:N}} \prod_i P_{\text{MEMM}}(t_i \mid t_{i-1}, o_{1:N})\)

Why MEMMs can beat HMMs

  • Can condition on whole input and incorporate global cues.
  • Can use features that look at the token and its context.

Example: “The plot of the movie was discombobulating.”

  • even if “discombobulating” is unseen, features like “ends in -ing” can help predict a POS tag.

Features + evidence weights

A feature function \(f_k(t_i, t_{i-1}, o_{1:N}, i)\) fires when a condition holds, e.g.

  • \(f_1 = 1\) if current token ends with “ing”
  • \(f_2 =\) length of token

Each tag \(t\) has learned weights \(w_k^{(t)}\) that indicate how strong the feature’s evidence is for that tag.


MEMM local probability (multinomial logistic regression form)

For a candidate tag \(t_i\): \(P_{\text{MEMM}}(t_i \mid t_{i-1}, o_{1:N}) = \frac{\exp\left(\sum_k w_k^{(t_i)}\, f_k(t_i, t_{i-1}, o_{1:N}, i)\right)}{Z}\) where \(Z\) normalizes over all possible tags at position \(i\).

Tiny numeric example

If features for “sitting” are \(f_1=1\) (ends in “ing”) and \(f_2=7\) (length):

  • weights for VERB: \([3, 0]\)
  • weights for PREP: \([-1, -2]\)

Then: \(P(t_i=\text{VERB}\mid\cdot) \propto \exp(3\cdot 1 + 0\cdot 7)\) \(P(t_i=\text{PREP}\mid\cdot) \propto \exp((-1)\cdot 1 + (-2)\cdot 7)\)


HMMs vs MEMMs via Viterbi

Viterbi recurrence for an HMM: \(v_i(t) = \max_{\text{prev }t}\; P_{\text{HMM}}(t\mid \text{prev }t)\; P_{\text{HMM}}(o_i\mid t)\; v_{i-1}(\text{prev }t)\)

Viterbi-style recurrence for an MEMM: \(v_i(t) = \max_{\text{prev }t}\; P_{\text{MEMM}}(t\mid \text{prev }t, o_{1:N})\; v_{i-1}(\text{prev }t)\)

Key difference:

  • MEMM uses a classifier, so there are no explicit emission/transition matrices.

Text classification (setup)

Formally:

  • input: text \(x\)
  • output: label \(y\) from a finite set
  • goal: learn \(P(y\mid x)\)

Examples:

  • sentiment: \(y\in{\text{positive},\text{negative}}\)
  • spam detection:
\[y \in {\text{spam},\text{not-spam}}\]

Binary logistic regression model

Binary labels: \(y \in {0,1}\). Define a score:

\[z = \sum_{i=1}^{|f|} w_i f_i\]

Probabilities:

\[P(y=1\mid x) = \frac{e^z}{1+e^z}\] \[P(y=0\mid x) = \frac{1}{1+e^z}\]

Logistic (sigmoid) function

\[\sigma(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{-z}}\]

Properties:

  • \(\sigma(z)\in [0,1]\)
  • \(\sigma(0)=\frac{1}{2}\)

Example: sentiment features → probability

Example input: “The movie was great”.

1) Extract features \(\mathbf{f}\) (toy example):

  • bias \(f_0=1\)
  • \(#words\), \(#positive\ words\), \(#negative\ words\), etc.

2) Dot product: \(z = \sum_i f_i w_i\)

3) Convert to probability: \(P(y=1\mid x) = \sigma(z)\)

Example shown in slides (toy values):

  • \(z=3\) → \(\sigma(3)\approx 0.95\)

For “The movie was okay”, the toy example yields \(z=0\), so: \(P(y=1\mid x)=0.5,\quad P(y=0\mid x)=0.5\)


Learning weights (maximum likelihood)

Given training examples \((x_j, y_j)\), we want \(w\) that maximizes likelihood: \(w_{\text{MLE}} = \arg\max_w \prod_{j=1}^{N} P(y=y_j\mid x_j; w)\)

In log space (equivalently maximize log-likelihood): \(w_{\text{MLE}} = \arg\max_w \sum_{j=1}^{N} \log P(y_j\mid x_j; w)\)

Equivalently minimize negative log-likelihood (log loss): \(w_{\text{MLE}} = \arg\min_w \sum_{j=1}^{N} -\log P(y_j\mid x_j; w)\)

Optimization via SGD

Update each weight with learning rate \(\alpha\): \(w_i^{(t+1)} = w_i^{(t)} - \alpha\,\frac{\partial L_j}{\partial w_i}\)

A key gradient identity: \(\frac{\partial L_j}{\partial w_i} = f_i^{(j)}\,\sigma\left(\sum_i w_i f_i^{(j)}\right) - y_j\)


Takeaways

  • Feature engineering matters.
  • Logistic regression loss is convex → one global minimum.
  • Learn \(w\) by maximizing log-likelihood (or minimizing negative log-likelihood).
  • Standard libraries (e.g., scikit-learn) can train logistic regression models.

Recap: Binary logistic regression

Setup

  • Input text \(x\)
  • Label \(y \in {0,1}\)
  • Feature engineering: \(f_0 = 1\) (bias), plus task-specific features \(f_1,\dots,f_K\)
  • Weights: \(w = [w_0, w_1, \dots, w_K]\)

Model

Define score: \(z = \sum_i w_i f_i(x)\)

Then: \(P(y=1\mid x) = \frac{e^z}{1+e^z}\) \(P(y=0\mid x) = \frac{1}{1+e^z}\)

Learning (SGD on negative log likelihood)

Goal: \(w_{\text{MLE}} = \arg\min_w \sum_{j=1}^{N} -\log P(y_j\mid x_j; w)\)

SGD update for a single example \((x_j, y_j)\) with learning rate \(\alpha\): \(w_i^{(t+1)} = w_i^{(t)} - \alpha\,\frac{\partial L_j}{\partial w_i}\)


Word similarity as a practical NLP concept

Motivating QA example:

  • Q: “How tall is Mt. Everest?”
  • A1: “Mt. Everest is 29029 feet high.” (relevant)
  • A2: “Mt. Everest is 1000000 years old.” (not relevant)

This requires a notion of semantic similarity between words/sentences.

What should “similarity” capture?

  • Synonymy: same meaning in some/all contexts (e.g., tall/high)
  • Antonymy: opposite along some dimension (e.g., dark/light)
  • More general similarity: shares some elements of meaning (car/bicycle), with degrees (cow closer to chicken than tiger).

Word vectors and the distributional hypothesis

Word vectors

Represent each word type as a \(d\)-dimensional vector \(x_{\text{word}} \in \mathbb{R}^d\). Then word similarity becomes vector similarity.

Distributional hypothesis (Firth, 1957)

“You shall know a word by the company it keeps.”

Words that occur in similar contexts tend to have similar meaning.

Example in slides: “tesgüino” appears near contexts like “bottle of ___”, “makes you drunk”, “out of corn” → likely an alcoholic drink.


Co-occurrence vectors (sparse) and cosine similarity

Word–word co-occurrence matrix

Classic approach: represent a word by counts of nearby context words.

Properties:

  • Dimension roughly \(|V|\) (often \(10\text{k} \)–\(50\text{k}\))
  • Very sparse

Cosine similarity

Given vectors \(\vec v\) and \(\vec w\):

\[\cos(\vec v, \vec w) = \frac{\vec v \cdot \vec w}{\lVert \vec v \rVert\,\lVert \vec w \rVert} = \frac{\sum_i v_i w_i}{\sqrt{\sum_i v_i^2}\,\sqrt{\sum_i w_i^2}}\]

Interpretation:

  • \(1\): same direction
  • \(0\): orthogonal
  • \(-1\): opposite direction

Issue: raw frequency counts overweight stopwords

Very frequent words like “a”, “the”, “it” co-occur with almost everything and can dominate cosine similarity.


TF–IDF

Term frequency (log-scaled)

Let \(\mathrm{count}(t,d)\) be occurrences of word \(t\) in document \(d\): \(\mathrm{tf}_{t,d} = \begin{cases} 1 + \log_{10}(\mathrm{count}(t,d)) & \text{if } \mathrm{count}(t,d) > 0 \\ 0 & \text{otherwise} \end{cases}\)

Inverse document frequency

Let \(N\) be #documents and \(\mathrm{df}_t\) be #documents containing \(t\): \(\mathrm{idf}_t = \log\left(\frac{N}{\mathrm{df}_t}\right)\)

TF–IDF weight

\(w_{t,d} = \mathrm{tf}_{t,d}\,\mathrm{idf}_t\)

Words with low IDF: very common words that appear in many documents.


Dense word vectors (embeddings)

Dense embeddings:

  • Lower dimensional than TF–IDF (e.g., hundreds to thousands)
  • Not sparse
  • Dimensions typically not directly interpretable

Goal: learn word embeddings such that semantic similarity corresponds to vector similarity.

Approaches introduced:

  • Skip-gram
  • CBOW (mentioned)

Word2Vec: Skip-gram with Negative Sampling (SGNS)

Intuition: a binary prediction task

Instead of counting co-occurrence directly, train a classifier:

Given target word \(w\) and candidate context word \(c\):

  • positive pair \((w,c)\): observed within a context window
  • negative pair \((w,n)\): randomly sampled “noise” word \(n\) from the vocabulary

Model: \(P(+\mid w,c) = \sigma(\vec c \cdot \vec w) = \frac{1}{1+\exp(-\vec c\cdot\vec w)}\) and: \(P(-\mid w,c) = 1 - P(+\mid w,c)\)

Objective (one target word)

With \(k\) negative samples \(n_1,\dots,n_k\):

\[L(\theta) = \log P(+\mid w,c) + \sum_{i=1}^{k} \log P(-\mid w, n_i)\]

Expanded into sigmoid form (common SGNS expression):

\[L(\theta) = \log \sigma(\vec c\cdot \vec w) + \sum_{i=1}^{k} \log \sigma(-\vec n_i \cdot \vec w)\]

Gradients

Let \(\vec c_{pos}\) be the positive context vector, \(\vec c_{neg}\) a negative context vector.

\[\frac{\partial L}{\partial \vec c_{pos}} = \big(\sigma(\vec c_{pos}\cdot\vec w) - 1\big)\,\vec w\] \[\frac{\partial L}{\partial \vec c_{neg}} = \sigma(\vec c_{neg}\cdot\vec w)\,\vec w\] \[\frac{\partial L}{\partial \vec w} = \big(\sigma(\vec c_{pos}\cdot\vec w) - 1\big)\,\vec c_{pos} + \sum_{i=1}^{k} \sigma(\vec c_{neg,i}\cdot\vec w)\,\vec c_{neg,i}\]

What vector represents the word?

a couple options (common in Word2Vec implementations):

  • keep just \(\vec w\) (target embedding)
  • or concatenate / combine \(\vec w\) and \(\vec c\) embeddings for a final representation

What embeddings can encode

Analogies

Embeddings often support linear analogies: \(\mathrm{vec}(\text{king}) - \mathrm{vec}(\text{man}) + \mathrm{vec}(\text{woman}) \approx \mathrm{vec}(\text{queen})\)

\[\mathrm{vec}(\text{Paris}) - \mathrm{vec}(\text{France}) + \mathrm{vec}(\text{Italy}) \approx \mathrm{vec}(\text{Rome})\]

Biases

Embeddings can reflect societal biases present in training data, e.g.: \(\mathrm{vec}(\text{doctor}) - \mathrm{vec}(\text{father}) + \mathrm{vec}(\text{mother}) \approx \mathrm{vec}(\text{nurse})\) and other stereotyped relations


Feed-forward neural networks (FFNNs)

The problem: generic word vectors might not be optimal

Generic pretrained embeddings (e.g., word2vec) capture broad distributional similarity, but they are not guaranteed to separate words in a way that is optimal for a specific downstream task.

  • For NER, we’d like embeddings for LOC words to cluster away from PER words (and other entity types) if that improves tagging performance.
  • A neural network can learn a task-specific representation by transforming the generic embedding into a hidden representation that is tuned to the labels.

Our LOC/PER example

Consider a token-level classification setting (e.g., NER):

  • Input: a word \(w\) (or a small context window around \(w\))
  • Output: a distribution over tags/classes such as \({\mathrm{LOC}, \mathrm{PER}, \mathrm{ORG}, \mathrm{O}, \dots}\)

A simple neural approach:

  1. Start from the word’s embedding \(x\) (e.g., word2vec vector)
  2. Compute a hidden representation \(h\) (learned transform of \(x\))
  3. Predict class probabilities \(y\) from \(h\)

Feedforward network

A feed-forward neural network (FFNN) maps input \(x\) to output \(y\) through one or more layers:

  • Input layer: word vector \(x \in \mathbb{R}^{n_0}\)
  • Hidden layer: \(h \in \mathbb{R}^{n_1}\)
  • Output layer: \(y \in \mathbb{R}^{C}\) (probabilities over \(C\) classes)

With one hidden layer: \(h = g(Wx + b)\) \(y = \mathrm{softmax}(Uh)\) where:

  • \(W \in \mathbb{R}^{n_1 \times n_0}\), \(b \in \mathbb{R}^{n_1}\)
  • \(U \in \mathbb{R}^{C \times n_1}\)

Fully-connected FFNN

“Fully-connected” means each unit in a layer connects to all units in the previous layer.

  • Each hidden node \(h_i\) depends on every input dimension: \(h_i = g(w_i^\top x + b_i)\)
  • Each output class score depends on every hidden node: \(z = Uh\), then \(y = \mathrm{softmax}(z)\)

The bias term

Bias lets the model learn “thresholds” or offsets.

Hidden layer: \(a = Wx + b,\quad h = g(a)\)

A common trick: append a constant 1 to the input and treat bias as just another weight. If \(x’ = [x; 1]\), then \(Wx + b\) becomes \(W’ x’\).

Sigmoid activation function

Sigmoid maps \(\mathbb{R} \to (0,1)\): \(\sigma(s)=\frac{1}{1+e^{-s}}\)

Properties:

  • squashes large positive values toward 1 and large negative values toward 0
  • smooth and differentiable (useful for gradient-based learning)

Derivative (often used in backprop): \(\sigma'(s)=\sigma(s)\bigl(1-\sigma(s)\bigr)\)

Other activation functions

Common alternatives:

  • \(\tanh\) (range \([-1,1]\)): \(\tanh(s)=\frac{e^s-e^{-s}}{e^s+e^{-s}}\)
  • ReLU: \(\mathrm{ReLU}(s)=\max(s,0)\)

Alternate activation functions \(g(\cdot)\)

We typically write hidden computations generically as: \(h = g(Wx+b)\) The choice of \(g(\cdot)\) affects:

  • training dynamics (vanishing gradients vs. stable gradients)
  • representation capacity
  • sparsity (ReLU)

Switch to matrix computations!

Use vector/matrix form (fast, clean, and how libraries implement it):

  • Input is a column vector: \(x = [x_1,\dots,x_{n_0}]^\top\)
  • Hidden: \(h = [h_1,\dots,h_{n_1}]^\top\)
  • Output: \(y = [y_1,\dots,y_C]^\top\)

Compute: \(a = Wx + b,\quad h = g(a)\) \(z = Uh,\quad y = \mathrm{softmax}(z)\)

Back to the original picture: from \(x\) to \(h\)

Interpretation:

  • \(W\) “re-mixes” the embedding dimensions of \(x\)
  • \(g(\cdot)\) introduces nonlinearity
  • \(h\) becomes a task-tuned representation

Concretely: \(h_i = g(w_i^\top x + b_i)\)

From \(h\) to \(y = \mathrm{softmax}(Uh)\)

Convert scores \(z = Uh\) into a probability distribution: \(y_i=\mathrm{softmax}(z_i)=\frac{e^{z_i}}{\sum_{k=1}^{C} e^{z_k}}\)

Interpretation:

  • \(y_i\) is the model’s probability of class \(i\) given the input (e.g., \(P(\mathrm{ORG}\mid w)\)).

So where do the weights come from?

We learn \(W,U\) (and biases) from labeled data:

  1. Treat weights as variables to be optimized
  2. Define a loss function that measures prediction error vs. ground truth
  3. Use gradient-based optimization (e.g., SGD) to minimize the loss

Historical note

Neural nets have a long history (starting early 1940s):

  • McCulloch & Pitts (1943)
  • Two perspectives:
    • models inspired by the brain
    • flexible function approximators for complex mappings

Training FFNNs: SGD, cross-entropy, backpropagation, and language modeling

Stochastic gradient descent (SGD)

We minimize an objective \(L(\theta)\) (loss) over parameters \(\theta\) by iteratively updating: \(\theta \leftarrow \theta - \eta \nabla_\theta L(\theta)\) where \(\eta\) is the learning rate.

In SGD, we estimate gradients using:

  • a single example, or
  • a mini-batch of examples

Cross-entropy loss

For multi-class classification with softmax outputs \(y\) and a one-hot target \(t\):

\[L = -\sum_{c=1}^{C} t_c \log y_c\]

If the true class is \(c^*\), then \(t_{c^*}=1\) and: \(L = -\log y_{c^\*}\)

Error backpropagation procedure

Backprop computes gradients efficiently using the chain rule.

Define:

  • hidden pre-activation \(a = Wx + b\)
  • hidden activation \(h = g(a)\)
  • output scores \(z = Uh\)
  • probabilities \(y=\mathrm{softmax}(z)\)

Then compute errors from output to input.

Degree of error at node \(j\) in the output layer

With softmax + cross-entropy, the output-layer “error signal” is:

\[\delta^{(out)} = y - t\]

Component-wise: \(\delta^{(out)}_j = y_j - t_j\)

Computing the error at hidden layer nodes

Propagate error backward through \(U\) and the activation \(g\):

\[\delta^{(hid)} = (U^\top \delta^{(out)}) \odot g'(a)\]

where \(\odot\) is element-wise multiplication.

Adjusting / updating the weights

Gradients for the output layer:

\[\frac{\partial L}{\partial U} = \delta^{(out)} h^\top\]

Gradients for the hidden layer:

\[\frac{\partial L}{\partial W} = \delta^{(hid)} x^\top\]

Bias gradients:

\[\frac{\partial L}{\partial b_{out}} = \delta^{(out)},\quad \frac{\partial L}{\partial b} = \delta^{(hid)}\]

Then update (SGD): \(U \leftarrow U - \eta \frac{\partial L}{\partial U},\quad W \leftarrow W - \eta \frac{\partial L}{\partial W}\)

Learning all the way back to embeddings

If \(x\) is an embedding pulled from an embedding matrix \(E\), we can learn \(E\) too.

Key point:

  • gradients can flow into the input vector \(x\)
  • that updates the corresponding row/column of \(E\) for the word(s) used

So the model learns task-specific embeddings.

Incorporating embedding vector selection

Embedding lookup can be written as a matrix multiplication with a one-hot vector \(v_w\):

  • one-hot \(v_w \in {0,1}^{V}\)
  • embedding matrix \(E \in \mathbb{R}^{V\times d}\)

Then: \(x = E^\top v_w\)

Only the selected word’s embedding receives gradient updates (efficient sparse update).

For context windows, the same embedding matrix \(E\) is reused (shared) across positions.

FFNNs for language modeling

FFNN language model idea:

take a fixed-size context window of previous words (like an \(n\)-gram)

map words to embeddings (often) concatenate embeddings into one vector

feed into FFNN

output softmax over vocabulary for the next word

If context embeddings are concatenated:

input dimension is \(n_0 = (n-1)\cdot d\)

output classes \(C =V\)

Training uses cross-entropy with the true next word as the target: \(L = -\log P(w_t \mid w_{t-(n-1):t-1})\) where \(P(\cdot)\) comes from the softmax output.

Recursive / Recurrent Neural Networks (RNNs)

Neural LMs vs. n-gram LMs

Neural language models (NLMs) share the basic goal of predicting the next word, but they differ from n-gram models in how they generalize.

Advantages over n-gram LMs

  • Don’t need smoothing.
  • Can (in principle) handle much longer histories than fixed-order n-grams.
  • Can generalize over contexts of similar words (via shared embeddings and learned representations).
  • Often higher predictive accuracy given the same training data size.

Disadvantages

  • Much slower to train.

Feedforward neural LMs: limitations

A feedforward LM (FFNN LM) typically:

  • Uses a fixed-size context window.
  • Treats the same phrase differently depending on position in the window (e.g., “the fish” at position 1 vs. position 2).
  • Struggles with tasks that require information arbitrarily far back in the sequence.

These limitations motivate recurrent models.

Recurrent neural networks (RNNs): the idea

An RNN is a network with a cycle: the hidden layer at time \(t\) depends on:

  • the input at time \(t\)
  • the hidden state at time \(t-1\)

This makes the hidden state a form of memory/context that can, in principle, encode information back to the start of the sequence.

RNN unfolded in time (forward computation)

It’s often helpful to “unroll” an RNN across time steps. Conceptually:

  • each time step has its own hidden activation \(h_t\)
  • weights are shared across timesteps

In slide diagrams, each node typically represents an entire layer, and each edge represents a weight matrix.

Simple RNN: computation structure

A common simple RNN (one hidden layer) uses three parameter matrices:

  • \(U\): hidden-to-hidden (previous state to current state)
  • \(W\): input-to-hidden
  • \(V\): hidden-to-output

Forward equations (generic activation \(g(\cdot)\)): \(h_t = g\big(Uh_{t-1} + Wx_t\big)\) \(y_t = \mathrm{softmax}(Vh_t)\)

Key properties:

  • The input vector dimension (e.g., embedding size) must be fixed ahead of time (same as FFNNs).
  • The matrices \(U, W, V\) do not change with \(t\); parameter count is constant w.r.t. sequence length.
  • In principle, \(h_t\) depends on the entire history via repeated substitution: \(h_t = g\!\left(U g\!\left(U g(\cdots U h_0 + \cdots) + \cdots\right) + \cdots\right)\)

Can an RNN be reduced to a very wide FFNN?

If you fix the input length (say \(x_1,x_2,x_3\)), you can “unroll” the computation into a wide feedforward network (no cycles). But this only works for fixed sequence lengths—RNNs are designed to naturally handle variable-length sequences.

Practical limits: vanishing gradients + compute time

Even though the entire history can influence \(h_t\) “in principle,” in practice:

  • gradients can vanish during training (later covered via vanishing/exploding gradients)
  • for very long sequences, it takes time to propagate information from early tokens to far later tokens

A common workaround is to process long inputs in subsequences (truncated BPTT), whereas FFNNs can process tokens in parallel.


Training RNNs: backpropagation through time (BPTT)

Training procedure for a simple recurrent network

As with FFNNs, training uses:

  • labeled training set (or next-word supervision for language modeling)
  • a loss function comparing output to gold targets
  • gradient-based optimization (e.g., SGD)

But now errors have time dependencies because:

  • \(h_t\) influences \(y_t\)
  • \(h_t\) also influences future hidden states \(h_{t+1}, h_{t+2}, \dots\), and thus future losses

Backpropagation through time (BPTT)

BPTT “unrolls” the RNN for \(T\) timesteps and backpropagates errors through this unfolded graph.

At a given time step \(t\), the hidden state contributes to:

  • loss at time \(t\)
  • loss at time \(t+1\) (through its effect on \(h_{t+1}\))
  • and so on

So gradients for \(U, W, V\) accumulate contributions from multiple timesteps.


RNN architectures for NLP applications

Sequence tagging / labeling

For tasks like POS tagging or NER:

  • input: word sequence \(x_{1:T}\)
  • output: a label per token \(y_{1:T}\)

An RNN can output \(y_t\) at each timestep and predict token-level labels.

Language modeling / generation

Autoregressive generation: each generated word is conditioned on previous generated words/states.

Teacher forcing (training):

  • feed the gold previous token as input at each timestep
  • compute loss against the gold next token
  • backpropagate through time

One label for the entire sequence (sequence classification)

For sentence/document classification:

  • no token-level outputs are needed
  • use the final hidden state \(h_T\) (or an aggregation) as input to a classifier (often an FFNN + softmax)
  • end-to-end training uses a loss for the final label only

Bidirectional RNNs

Bidirectional RNNs train two RNNs:

  • forward (left-to-right)
  • backward (right-to-left)

Combine hidden states (commonly via concatenation): \(h_t^{\text{bidir}} = \big[h_t^{\rightarrow};\; h_t^{\leftarrow}\big]\)

This can help when predictions benefit from both left and right context.

Encoder–decoder RNN architecture

Encoder–decoder (seq2seq) overview

A basic encoder–decoder model uses two RNNs:

  • Encoder: maps an input sequence \(x_{1:n}\) to an intermediate representation (“context”).
  • Decoder: maps that context to an output sequence \(y_{1:m}\).

This is the classic sequence-to-sequence (seq2seq) setup (e.g., early neural machine translation, chatbots).

Components of an encoder–decoder model

Three conceptual components:

  1. Encoder produces contextualized hidden states \(h^{e}_{1:n}\) for the input tokens.
  2. Context vector \(c\), a function of encoder hidden states, intended to convey “the essence” of the input to the decoder.
  3. Decoder generates hidden states \(h^{d}{1:m}\), and outputs \(y{1:m}\) via a softmax layer.

Early neural machine translation (bitext training intuition)

Training data is bitext: pairs of aligned sentences (parallel text).

  • Source: language translated from
  • Target: language translated to
  • Often separate source and target with an end-of-sentence marker like \(\langle/ s \rangle\)
  • Train autoregressively to predict the next token over the concatenated source–target sequence.

Inference-time decoding (high level)

At test time:

  1. Run the source sentence through the encoder to produce encoder states.
  2. Initialize the decoder with context and generate target tokens autoregressively until \(\langle/ s \rangle\).

Encoder–decoder equations and training

Context available throughout decoding

A common design makes the encoder context available at every decoder step so it doesn’t “fade”.

Core equations (with \(g\) as the RNN cell update and \(\hat{y}_{t-1}\) the embedding of the previously generated token):

\(c = h_n^{e}\) \(h_0^{d} = c\) \(h_t^{d} = g(\hat{y}_{t-1}, h_{t-1}^{d}, c)\) \(\hat{y}_t = \mathrm{softmax}(h_t^{d})\)

The output \(\hat{y}_t\) is a distribution over the vocabulary; generation samples (or greedily selects) from it.

Common encoder architecture

In practice, encoders are often stacked Bi-LSTMs (though in principle can be a simple RNN/LSTM/etc.).

  • The final contextual representation can be formed by concatenating top-layer forward/backward states.
  • For token-level training, you can feed per-timestep contextualized states to an output layer.

Training: teacher forcing

Encoder–decoder models are trained end-to-end, commonly using teacher forcing in the decoder:

  • during training, feed the gold previous target token as the next input \(x_{t+1}\)
  • speeds up training and stabilizes learning.

The encoder–decoder bottleneck

Compression into a single context vector

A vanilla encoder–decoder compresses all source-side information into a single vector \(c\). This creates a bottleneck:

  • information about specific source tokens isn’t directly accessible during later decoding steps
  • longer inputs are harder to represent well in one fixed-length vector.

Heuristic idea: pick the most relevant encoder state

If we could choose a relevant encoder state \(h^{enc}_t\) for the current decoder step (instead of a fixed \(c\)), we could reduce the bottleneck.


Attention

Motivation

We want a method to compute a context vector that:

  • uses the entire encoder context,
  • updates dynamically during decoding (focus on relevant input tokens),
  • remains a fixed-size vector each step.

Attention as a weighted average of encoder states

At decoder step \(i\), compute:

  • attention weights \(\alpha_{ij}\) over encoder states \(h_j^{e}\),
  • a context vector \(c_i = \sum_j \alpha_{ij} h_j^{e}\).

The weights depend on the current decoder state (often \(h_{i-1}^{d}\)).

Dot-product attention (one scoring option)

Compute a relevance score between decoder and encoder states: \(\mathrm{score}(h_{i-1}^{d}, h_{j}^{e}) = h_{i-1}^{d} \cdot h_{j}^{e}\)

Normalize with softmax to get weights: \(\alpha_{ij} = \mathrm{softmax}(\mathrm{score}(h_{i-1}^{d}, h_{j}^{e})) = \frac{\exp(\mathrm{score}(h_{i-1}^{d}, h_{j}^{e}))}{\sum_k \exp(\mathrm{score}(h_{i-1}^{d}, h_{k}^{e}))}\)

Then compute: \(c_i = \sum_{j} \alpha_{ij} h_{j}^{e}\)

Decoder update with attention context

Use the per-step context \(c_i\) when computing the next decoder hidden state: \(h_i^{d} = g(\hat{y}_{i-1}, h_{i-1}^{d}, c_i)\)

Transformers

Self-attention

A Transformer replaces recurrence with attention, letting each token directly “look at” other tokens in the sequence.

Given an input sequence of embeddings \(X \in \mathbb{R}^{n \times d_{\text{model}}}\), we project to:

  • Queries: \(Q = XW_Q\)
  • Keys: \(K = XW_K\)
  • Values: \(V = XW_V\)

where \(W_Q, W_K, W_V \in \mathbb{R}^{d_{\text{model}} \times d_k}\) (often \(d_k = d_{\text{model}}/h\) with \(h\) heads).

The core operation is scaled dot-product attention:

\[\mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V\]
  • \(QK^\top \in \mathbb{R}^{n \times n}\) gives pairwise similarity scores between tokens.
  • \(\sqrt{d_k}\) scaling keeps logits in a reasonable range (stabilizes softmax and gradients).
  • \(M\) is an (optional) mask:
    • Causal mask for autoregressive decoding (prevent attending to future tokens).
    • Padding mask to ignore padding positions.

Interpretation: for each position \(i\), attention forms a weighted average of value vectors \(V_j\), with weights \(\alpha_{ij}\) computed from query–key similarity.

Key properties:

  • Long-range dependencies: any token can attend to any other token in one hop.
  • Parallelism: all positions compute attention in parallel (unlike RNNs).
  • Cost: attention is \(O(n^2)\) in sequence length due to \(QK^\top\).

Single-head

With a single head, we use one set of projections \(W_Q,W_K,W_V\).

Steps at a high level:

  1. Compute similarity logits: \(S = \frac{QK^\top}{\sqrt{d_k}} \in \mathbb{R}^{n \times n}\)
  2. Apply masks if needed: \(S \leftarrow S + M\)
  3. Normalize: \(A = \mathrm{softmax}(S)\) (row-wise, so each row sums to 1)
  4. Mix values: \(H = AV\)

So \(H \in \mathbb{R}^{n \times d_k}\) is the new contextual representation.

Causal masking (decoder / language modeling): For position \(i\), block attention to positions \(j > i\). The mask \(M\) sets those logits to \(-\infty\) (or a very negative number), ensuring \(\alpha_{ij}=0\).


Multi-head

Single-head attention forces one similarity geometry. Multi-head attention lets the model attend using different subspaces (“different kinds of relationships”) simultaneously.

For \(h\) heads:

  • Head \(r\) uses its own projections \(W_Q^{(r)}, W_K^{(r)}, W_V^{(r)}\)
  • Compute each head: \(\text{head}_r = \mathrm{Attn}(XW_Q^{(r)}, XW_K^{(r)}, XW_V^{(r)})\)
  • Concatenate and project back: \(\mathrm{MHA}(X) = \mathrm{Concat}(\text{head}_1,\dots,\text{head}_h)W_O\) where \(W_O \in \mathbb{R}^{(h\cdot d_k)\times d_{\text{model}}}\).

Typical dimensional choice:

  • \(d_k = d_v = d_{\text{model}}/h\)
  • Each head is cheaper, but total compute is similar to one larger attention.

Why it helps:

  • One head might focus on local syntax (nearby tokens),
  • another on long-distance agreement,
  • another on entity coreference, etc.

Positional Embeddings

Self-attention alone is permutation-invariant: if you shuffle tokens, attention has no built-in notion of order. Transformers inject position information via positional encodings added to input embeddings.

Common approaches

1) Learned absolute positional embeddings

  • Learn a position vector \(p_i \in \mathbb{R}^{d_{\text{model}}}\) for each position \(i\).
  • Input becomes: \(x_i \leftarrow e(w_i) + p_i\)

Pros: simple and effective.
Cons: fixed maximum length unless extended.

2) Sinusoidal (fixed) positional encodings (original Transformer) Define deterministic sin/cos waves of different frequencies: \(\mathrm{PE}(i,2k)=\sin\!\left(\frac{i}{10000^{2k/d_{\text{model}}}}\right),\quad \mathrm{PE}(i,2k+1)=\cos\!\left(\frac{i}{10000^{2k/d_{\text{model}}}}\right)\) Pros: extrapolates to longer lengths (in principle).
Interpretation: encodes relative offsets through linear combinations.

3) Relative / rotary / bias-based positions (conceptual) Instead of adding absolute vectors, incorporate relative position into attention scores (useful for length generalization and locality). Many modern LMs use variants of this idea.


Layer Norm

Transformers rely heavily on residual connections + normalization for stable training.

Layer normalization (per token) normalizes across hidden dimensions: Given a hidden vector \(x \in \mathbb{R}^{d}\), \(\mu = \frac{1}{d}\sum_{i=1}^{d} x_i,\quad \sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i-\mu)^2\) \(\mathrm{LN}(x) = \gamma \odot \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta\) where \(\gamma,\beta \in \mathbb{R}^{d}\) are learned scale/shift and \(\epsilon\) is for numerical stability.

Where LayerNorm is applied

A Transformer block has two sublayers (attention and FFN), each wrapped with residual + normalization. Two common conventions:

  • Post-norm (original): \(x \leftarrow \mathrm{LN}(x + \mathrm{Sublayer}(x))\)
  • Pre-norm (common in modern LMs): \(x \leftarrow x + \mathrm{Sublayer}(\mathrm{LN}(x))\)

Pre-norm often improves optimization stability for deep Transformers.


Feed-forward Layer

Each Transformer block includes a position-wise feed-forward network (FFN): the same MLP applied independently to each token.

Form: \(\mathrm{FFN}(x) = W_2\,\phi(W_1 x + b_1) + b_2\)

  • \(W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}\)
  • \(W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}\)
  • \(\phi(\cdot)\) is a nonlinearity (ReLU or GELU are common)
  • \(d_{\text{ff}}\) is typically larger than \(d_{\text{model}}\) (e.g., 4×), giving the model extra capacity.

Putting it together: a Transformer block (high-level)

A standard block (encoder-style) looks like:

  1. Self-attention sublayer + residual + norm
  2. FFN sublayer + residual + norm

Decoder blocks add causal masking in self-attention and (in encoder–decoder models) an additional cross-attention sublayer.


Transformer Models

Decoder-only transformer models

Structure: stack of Transformer decoder blocks with causal self-attention (each token attends only to previous tokens).

Training objective (causal language modeling): Predict next token: \(\max \sum_{t} \log P(w_t \mid w_{<t})\)

What it’s good at:

  • generation (completion, chat, code)
  • few-shot prompting (conditioning by putting instructions/examples in the prefix)

Key detail: causal mask enforces left-to-right dependence, making it naturally autoregressive.


Encoder-only transformer models

Structure: stack of Transformer encoder blocks with bidirectional self-attention (tokens can attend to both left and right context).

Typical pretraining objective (masked language modeling, MLM):

  • randomly mask some tokens and predict them using surrounding context.
  • model learns strong contextual representations because it can use both directions.

What it’s good at:

  • classification (sentiment, topic, NLI)
  • sequence labeling (NER, POS)
  • embedding / retrieval use cases (representations of text)

Key detail: because attention is bidirectional, it’s not inherently a generator (generation requires a decoding strategy/objective).


Encoder-decoder transformer models

Structure: two stacks:

  • Encoder: bidirectional self-attention over the input sequence.
  • Decoder: causal self-attention + cross-attention over encoder outputs.

Cross-attention (core idea):

  • decoder queries attend to encoder keys/values, letting the decoder condition on the full source.
  • same attention math, but \(Q\) comes from decoder states and \(K,V\) come from encoder states.

Training objective (seq2seq): Given source \(x\) and target \(y\), maximize: \(\sum_{t}\log P(y_t \mid y_{<t}, x)\)

What it’s good at:

  • translation
  • summarization
  • question answering with a generated answer
  • many “text-to-text” tasks (map input text to output text)

Why it helps vs. vanilla encoder–decoder RNNs:

  • no single-vector bottleneck: the decoder can attend to all encoder states through cross-attention.
  • parallelizable training on the encoder side and within attention computations.