Contextualized Representations: Architectural Deconstruction of BERT and GPT Pipelines
The core challenge of computational linguistics has always been finding an effective way to represent polysemy—the fact that a single word can change its meaning based on its context. Early vector space models like Word2Vec and GloVe assigned a single static vector to each word in a dictionary. While these models effectively captured general semantic associations, they failed when a token's meaning shifted based on its surroundings. For example, a static model would assign the exact same vector to the token "bank" in the phrases "investment bank" and "river bank." This limitation restricted the adaptability of downstream natural language processing pipelines.
To resolve this, early contextual frameworks like ELMo (Embeddings from Language Models) used bidirectional Long Short-Term Memory (LSTM) networks to generate distinct representations based on context. However, ELMo’s recurrence-based foundation suffered from sequential computation limits, making it difficult to scale up to massive text datasets.
The release of the Transformer architecture by Vaswani et al. in 2017 removed these constraints, enabling the development of the modern language modeling paradigm. By using multi-head self-attention mechanisms instead of recurrence, deep models could be trained across massive text corpora. This architectural leap led to two dominant, complementary approaches: BERT (Bidirectional Encoder Representations from Transformers), developed by Google, and GPT (Generative Pre-trained Transformer), developed by OpenAI.
This guide provides an in-depth look at these two model families. We will examine their tokenization systems, pre-training strategies, optimization dynamics, and parameter-efficient scaling techniques to ensure you are fully prepared for senior AI/ML engineering interviews.
1. The Evolution of Language Representation
The transition from traditional statistical approaches to modern self-supervised models was driven by changes in feature engineering and optimization objectives. Early text classification pipelines relied on sparse, count-based metrics such as Term Frequency-Inverse Document Frequency (TF-IDF) combined with shallow classifiers. While these methods were computationally efficient, they treated text as an unordered bag of words, discarding grammatical structure and word order.
Distributed representation models improved on this by mapping words into dense, continuous vector spaces where geometric closeness corresponds to semantic similarity. However, these models were still limited by their static nature.
The next shift introduced deep, contextualized representations by using large unlabelled text corpora for pre-training. This established the modern two-stage Pre-train and Fine-tune paradigm:
- Self-Supervised Pre-training: The model processes billions of raw tokens, learning language structure, syntax, and world knowledge by predicting hidden or upcoming words.
- Supervised Fine-tuning: The pre-trained weights are adapted to a specific downstream task (such as sentiment classification or named entity recognition) using a much smaller, labeled dataset.
By using the model's pre-trained weights as a starting point, downstream tasks require significantly less labeled data to achieve high performance, making advanced NLP viable across a wide variety of industries.
2. Architectural Foundations: The Core Transformer
BERT and GPT utilize different components of the original Transformer framework. BERT is built out of stacked Encoder blocks, which are optimized for bidirectional feature extraction. GPT is built out of stacked Decoder blocks, which include a causal masking mechanism designed for step-by-step autoregressive generation.
Both architectures rely on Scaled Dot-Product Attention to route information between tokens. Given an input matrix projected into Query ($Q$), Key ($K$), and Value ($V$) tensors, the attention calculation is defined as:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$
In Multi-Head Attention (MHA), this operation is split across multiple independent projection subspaces, allowing the model to simultaneously track relationships between different positions in a sequence:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$
$$\text{where} \quad \text{head}_i = \text{softmax}\left(\frac{QW_i^Q (KW_i^K)^\top}{\sqrt{d_k}}\right)VW_i^V$$
Where $W_i^Q$, $W_i^K$, and $W_i^V$ are projection matrices for head $i$, and $W^O$ is the output projection matrix.
The Structural Pre-LN Evolution
The original Transformer architecture placed Layer Normalization after the residual connections (Post-LN). However, training deep Post-LN models often caused gradient instability because the variance of the gradients fluctuated near the output layers.
Modern implementations of BERT and GPT use a Pre-LN configuration, applying Layer Normalization directly to the inputs of the sub-layers before the residual connections:
$$X_{\text{intermediate}} = X_l + \text{MultiHead}(\text{LayerNorm}(X_l))$$
$$X_{l+1} = X_{\text{intermediate}} + \text{FFN}(\text{LayerNorm}(X_{\text{intermediate}}))$$
This change ensures that gradients can flow cleanly through the identity path of the residual connections, stabilizing training and enabling networks to scale to hundreds of layers without diverging.
3. BERT: Bidirectional Representation Engineering
BERT was designed to generate deep, bidirectional representations by explicitly looking at both left and right context across all layers. This full context visibility is achieved by using an unmasked attention matrix, allowing every token to interact freely with every other token in the sequence.
Sub-word Tokenization: The WordPiece Algorithm
To manage large vocabularies without encountering out-of-vocabulary (OOV) errors, BERT utilizes the WordPiece tokenization algorithm. WordPiece breaks rare or complex words down into smaller, meaningful sub-word fragments (e.g., splitting "unaffable" into `["un", "##aff", "##able"]`), where the `##` prefix indicates a sub-word unit that attaches to a preceding token.
The WordPiece vocabulary is constructed using a data-driven optimization process:
- Initialize the vocabulary with all individual base characters and symbols present in the training text.
- Treat every unique sub-word in the corpus as a separate candidate token.
- Build a language model over the text using the current vocabulary, and evaluate the likelihood of the training data.
- Identify the two sub-words whose combination maximizes the training data likelihood ratio: $$\text{Score}_{(A,B)} = \frac{\text{count}(AB)}{\text{count}(A) \times \text{count}(B)}$$
- Add the combined token $AB$ to the vocabulary, and repeat the process until the target vocabulary size (e.g., 30,522 tokens for BERT) is reached.
During inference, WordPiece uses a greedy longest-match-first strategy to split incoming words into the largest available sub-words in the vocabulary.
Pre-training Objectives and Optimization
Because bidirectional attention allows tokens to see themselves across layers, standard auto-regressive language modeling (predicting the next token) would cause the model to inadvertently leak target words. To prevent this, BERT uses two self-supervised pre-training objectives:
1. Masked Language Modeling (MLM) and the 80/10/10 Rule
In a given text sequence, 15% of the input tokens are randomly selected for corruption. However, replacing all selected tokens with a static `[MASK]` token would create a mismatch between pre-training and fine-tuning, since the `[MASK]` token never appears during downstream tasks.
To resolve this, the 15% chosen tokens are modified using an 80/10/10 distribution rule:
- 80% of the time: The chosen token is replaced with the literal `[MASK]` token.
- 10% of the time: The chosen token is replaced with a completely random token from the vocabulary. This forces the model to maintain accurate representations for words that might be incorrect in context.
- 10% of the time: The chosen token is left completely unchanged. This biases the model toward preserving the true identity of the input tokens.
The loss function for the MLM task calculates the cross-entropy loss specifically over the 15% selected tokens, ignoring predictions for the uncorrupted tokens.
2. Next Sentence Prediction (NSP)
To help the model understand relationships between distinct sentences, BERT is trained on a binary classification task called Next Sentence Prediction. During pre-training, the model is fed sentence pairs ($A$ and $B$) joined by a separator token (`[SEP]`).
- 50% of the time, sentence $B$ is the actual sequential next sentence that follows $A$ in the text (`IsNext`).
- 50% of the time, sentence $B$ is a random sentence sampled from a completely different document in the corpus (`NotNext`).
The model learns to classify this relationship by extracting the representation of the special classification token (`[CLS]`) placed at the very start of the sequence.
Engineering Retrospective: Later research variants, such as RoBERTa (Robustly Optimized BERT Approach), proved that the NSP task was unnecessary and could occasionally degrade downstream performance. RoBERTa discarded the NSP task entirely, relying instead on dynamic masking patterns over longer sequences, which significantly improved representation quality.
Architectural Scale Layout
The foundational configurations of BERT are structured as follows:
- BERT-Base: 12 Layers ($L$), 768 Hidden Dimension ($H$), 12 Attention Heads ($A$), totaling 110M parameters.
- BERT-Large: 24 Layers ($L$), 1024 Hidden Dimension ($H$), 16 Attention Heads ($A$), totaling 340M parameters.
4. GPT: Causal Autoregressive Generation
In contrast to BERT's bidirectional encoder design, the GPT (Generative Pre-trained Transformer) family relies on an autoregressive model architecture. It is built out of stacked decoder blocks optimized to predict the probability distribution of the next token given a preceding context window.
Causal Masking and Unidirectional Constraints
To ensure the model cannot look ahead at future tokens during training, GPT applies a strict causal mask to the self-attention matrix. This mask is an upper-triangular matrix filled with $-\infty$ values. When added to the calculated attention scores before the softmax step, it zeroes out the weights assigned to subsequent positions:
$$M_{ij} = \begin{cases} 0 & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases}$$
$$\text{CausalAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V$$
This restriction means that a token at position $t$ can only attend to historical indices $\le t$, ensuring the model generates text in a strict left-to-right sequence.
Sub-word Tokenization: Byte-Pair Encoding (BPE)
GPT uses a modified version of Byte-Pair Encoding (BPE) at the byte level. While traditional BPE builds its vocabulary out of raw character strings, GPT-2 and subsequent models initialize their base vocabularies using raw bytes (256 units). This approach allows the model to handle any text sequence without encountering an out-of-vocabulary token, as unknown characters can simply be broken down into their individual byte components.
The BPE tokenization pipeline works as follows:
- Tokenize the entire training corpus into individual characters/bytes.
- Count all adjacent token pairs in the text.
- Identify the most frequent pair of tokens (e.g., `("e", "s")`) and merge them into a new vocabulary token (`"es"`).
- Repeat this merge step iteratively until the desired vocabulary limit (e.g., 50,257 tokens for GPT-2) is met.
To prevent the algorithm from merging unrelated character classes (such as punctuation and letters), GPT enforces a regex rule that restricts merges to related clusters of alphanumeric characters.
Mathematical Formulation of Causal Language Modeling (CLM)
Given an unlabelled training corpus of tokens $\mathcal{U} = \{u_1, \dots, u_n\}$, the GPT model optimizes parameters $\Theta$ to maximize the standard log-likelihood of the causal language modeling objective:
$$\mathcal{L}_{\text{CLM}}(\mathcal{U}) = \sum_{i=1}^n \log P(u_i | u_{i-k}, \dots, u_{i-1}; \Theta)$$
Where $k$ represents the maximum historical context window size. The final conditional token probability $P(u_i)$ is extracted by passing the output layer representations through a softmax classification layer mapped over the entire vocabulary space.
Generative Evolution: From GPT-1 to GPT-4 Scaling
The evolution of the GPT family reflects a broader trend toward scale-driven in-context learning:
- GPT-1 (2018): Explored the foundational two-stage paradigm, pre-training a 117M parameter decoder on unlabelled text followed by supervised task fine-tuning.
- GPT-2 (2019): Scaled the architecture to 1.5B parameters. The authors demonstrated that large-scale pre-training enabled the model to perform zero-shot task transfer without explicit fine-tuning, simply by responding to the context of the input prompt.
- GPT-3 (2020): Scaled the parameter count to 175B, utilizing a dense attention framework. GPT-3 showed that massive scale unlocks in-context few-shot learning, allowing the model to understand complex tasks from a few text examples included directly in the prompt.
- GPT-4 and Mixture of Experts (MoE): Modern architectures optimize training costs at scale by replacing dense layers with sparse Mixture of Experts (MoE) layers. Instead of activating all parameters for every token, a routing mechanism dynamically forwards token vectors to a subset of specialized expert networks, maximizing model capacity while maintaining strict computational efficiency limits.
5. Comparative Architecture Reference Matrix
BERT and GPT occupy distinct functional niches due to their structural differences. The following reference matrix outlines these operational trade-offs:
| Architectural Property | BERT (Bidirectional Encoder Stack) | GPT (Causal Decoder Stack) |
|---|---|---|
| Attention Context Profile | Fully Bidirectional. Tokens look left and right across all layers. | Strictly Unidirectional (Causal Masking). Tokens only look at previous indices. |
| Primary Pre-training Task | Masked Language Modeling (MLM) + Next Sentence Prediction (NSP). | Causal Language Modeling (CLM) next-token prediction. |
| Tokenization Paradigm | WordPiece (Character likelihood ratio scoring). | Byte-Pair Encoding (BPE at the byte level). |
| Optimal Task Alignment | Classification, Feature Extraction, Named Entity Recognition, QA. | Autoregressive Text Generation, Reasoning, Summarization, Code synthesis. |
| Downstream Adaptation | Requires adding task-specific linear classification heads and fine-tuning. | Adapted via prompt design, few-shot contexts, or instruction tuning (RLHF). |
| Inference Throughput Limit | High throughput; processes entire sequences in a single parallel pass. | Lower throughput; generation is bound by step-by-step token execution. |
6. Distributed Training Dynamics & Infrastructure Optimization
Training models with hundreds of billions of parameters requires specialized optimization techniques and distributed infrastructure to manage memory constraints and gradient stability.
The Weight Decay Fix: AdamW
Standard L2 regularization penalizes large weights by adding a penalty term directly to the loss function. When using the standard Adam optimizer, this approach can cause issues because the regularization penalty gets mixed into the running averages of the first and second moments of the gradients.
To address this, Loshchilov and Hutter introduced the AdamW optimizer, which decouples weight decay from the gradient accumulation steps, applying the penalty directly to the parameter update step:
$$g_t = \nabla \mathcal{L}(\theta_t) + \lambda \theta_t \quad \text{(Standard Adam L2 conflict)}$$
$$\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right) \quad \text{(AdamW Decoupled Step)}$$
Where $\hat{m}_t$ and $\hat{v}_t$ are the bias-corrected first and second moment estimates, $\eta$ is the learning rate, and $\lambda$ is the explicit weight decay coefficient. This change stabilizes the optimization process over long training runs.
Memory Management: ZeRO (Zero Redundancy Optimizer) Stages
When training large models, standard data-parallel approaches replicate the complete model state across all available GPUs, which can quickly exhaust device memory. The ZeRO memory saving framework eliminates this redundancy by partitioning model states across data-parallel nodes:
- ZeRO-Stage 1: Partitions the optimizer states (e.g., Adam's first and second moments) across processors. This reduces the memory footprint by up to $4\times$ without altering communication volume.
- ZeRO-Stage 2: Partitions both the optimizer states and the gradients. Each GPU stores only the gradients corresponding to its assigned portion of the optimizer states.
- ZeRO-Stage 3: Partitions all three core components: optimizer states, gradients, and the model parameters themselves. During training layers fetch missing weights from neighboring GPUs on the fly and delete them immediately after the forward and backward passes.
Hybrid Parallelism Strategies
When a model's state cannot fit onto a single GPU even with ZeRO optimization, engineering teams combine multiple parallelization approaches:
- Tensor Parallelism (Megatron-LM): Splits individual weight matrices across multiple GPUs. For example, the Feedforward layer's column-parallel projection splits matrix $W_1$ vertically across two chips, allowing the model to compute matrix operations in parallel before gathering the final results.
- Pipeline Parallelism: Groups layers into sequential chunks and distributes them across separate hardware nodes. To prevent idle time (the "pipeline bubble"), sequences are split into smaller micro-batches that flow through the execution layers concurrently.
Numerical Precision Profiles: FP16 vs. BF16
To maximize compute efficiency, modern training pipelines use lower-precision floating-point representations:
- FP16 (Half Precision): Uses 5 exponent bits and 10 mantissa bits. Because of its narrow dynamic range, FP16 can lead to numerical underflow problems where small gradients drop to zero. Managing this requires keeping a master copy of the weights in FP32 and applying dynamic loss scaling factors.
- BF16 (Brain Floating Point): Dedicates 8 exponent bits and 7 mantissa bits, matching the dynamic range of full FP32. This design prevents underflow errors and simplifies training stability, making BF16 the standard format for modern hardware accelerators.
7. Adaptation Strategies and Parameter Efficiency
Once a model has been pre-trained, it must be adapted to perform specific downstream tasks efficiently.
Fine-Tuning BERT for Downstream Tasks
Adapting BERT requires adding a small, task-specific classification layer on top of the pre-trained encoder stack:
- Sequence Classification: To classify entire sequences (e.g., sentiment analysis), a linear classification layer is attached to the output of the special classification token (`[CLS]`). The entire network is then fine-tuned using a supervised cross-entropy loss function.
- Span Extraction (Question Answering): For extractive QA tasks like SQuAD, the model learns two separate output target vectors: a Start Vector and an End Vector. The probability of a token at index $i$ being the start of the answer span is calculated using a dot product with the token's hidden representation: $$P_{\text{start}_i} = \frac{\exp(h_i^\top V_{\text{start}})}{\sum_j \exp(h_j^\top V_{\text{start}})}$$
Parameter-Efficient Fine-Tuning (PEFT): LoRA
As models scale to billions of parameters, full fine-tuning becomes computationally expensive, as it requires updating and saving every weight matrix across the entire network. To lower these costs, engineers use Low-Rank Adaptation (LoRA).
LoRA freezes the pre-trained weight matrices $W_0 \in \mathbb{R}^{d \times k}$ and injects trainable rank decomposition matrices alongside them. The forward pass update is defined as:
$$h = W_0 x + \Delta W x = W_0 x + \frac{\alpha}{r} (B \cdot A) x$$
Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$. The rank constraint $r$ is set to a low value (e.g., $r=4$ or $r=8$), ensuring $r \ll \min(d, k)$.
During training, matrix $A$ is initialized using a random Gaussian distribution, and matrix $B$ is initialized to zero, ensuring that $\Delta W = 0$ at step zero. This mechanism drastically reduces the number of trainable parameters (often by over 99%), lowering memory requirements while maintaining performance on downstream tasks.
8. Production Performance Bottlenecks and Optimization
Deploying large autoregressive models like GPT into high-throughput production environments introduces significant engineering challenges.
The Memory Bottleneck: KV-Caching Mechanics
During autoregressive generation, generating each new token requires calculating attention scores against all previous tokens in the sequence. Without optimization, this process would require re-computing the Key and Value states for the entire historical context window at every step, creating an $O(T^2)$ computational bottleneck.
To avoid this redundancy, production inference systems implement KV-Caching. This optimization saves the calculated key and value vectors for past tokens in GPU memory. For each new step, the system only needs to calculate the $K$ and $V$ tensors for the single newly generated token and append them to the existing cache, reducing the computational cost per token to $O(1)$.
While KV-caching saves compute cycles, it creates a memory bottleneck, as storing the cache for thousands of concurrent requests can quickly fill up GPU memory. To manage this overhead, engineers use architectures like Multi-Query Attention (MQA) or Grouped-Query Attention (GQA), which share a single set of key and value heads across multiple query heads to reduce cache size.
9. AI/ML Engineering Interview Preparation Hub
To clear technical screens for senior machine learning roles, candidates must demonstrate a deep understanding of structural and optimization details. Use these explicit technical answers during your preparation:
Advanced Technical Interview Questions
Q1: Why does BERT require an explicit 80/10/10 corruption rule during its Masked Language Modeling pre-training phase, rather than simply replacing all selected target tokens with the `[MASK]` token?
Answer: If the model only encountered the literal `[MASK]` token during pre-training, it would create an architectural mismatch with downstream fine-tuning, where the `[MASK]` string never appears. This could cause performance issues because the model might fail to generate high-quality representations for standard words. The 80/10/10 rule addresses this discrepancy. By keeping 10% of the selected tokens unchanged, the model learns to maintain high-quality contextual representations for standard words. By replacing 10% with random tokens, it learns that a token's representation must account for contextual anomalies, forcing it to use the global context rather than relying solely on individual token lookups.
Q2: Why is the AdamW optimizer preferred over standard Adam when training large Transformer models with L2 weight regularization?
Answer: In standard Adam, the L2 regularization penalty is added directly to the loss function. When computing gradients, this penalty gets mixed into the running averages of the first and second moments. As a result, the regularization step is scaled by the inverse of the historical gradient variance. This means that weights with exceptionally frequent or large gradients receive less relative regularized penalty than weights with small gradients, which distorts the intended weight decay effect. AdamW solves this by decoupling weight decay from the gradient update tracking, applying the penalty directly to the parameters after the moment tracking updates are computed.
Q3: Explain the low-rank assumption that justifies using LoRA for parameter-efficient adaptation.
Answer: Research has shown that during adaptation, the weight updates $\Delta W$ have a low "intrinsic dimension." This means that the meaningful information in the update can be captured in a subspace of much lower rank than the full weight matrix. LoRA leverages this by factoring the large update matrix $\Delta W \in \mathbb{R}^{d \times k}$ into two low-rank matrices $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, where $r \ll \min(d, k)$. This compression reduces memory usage and training time while maintaining task performance.
10. Final Mastery Summary
The development of BERT and GPT models established the modern foundations of large-scale natural language processing. By using different components of the original Transformer architecture, they created two complementary paths for language modeling. BERT's unmasked bidirectional encoder design makes it highly effective for sequence classification, structured entity extraction, and language understanding. GPT's causally masked autoregressive decoder design paved the way for open-ended text generation, reasoning, and context-driven few-shot learning.
To pass competitive AI engineering interviews, you should focus on these underlying mechanics. Demonstrating a clear understanding of sub-word tokenization algorithms, optimization details like AdamW, low-rank adaptation constraints, and infrastructure scaling methods proves that you can confidently design, train, and deploy advanced model architectures in production environments.