History of LLM Evolution (5): Building the Path of Self-Attention — The Future of Language Models from Transformer to GPT

Prerequisites: Previous micrograd and makemore series (optional), familiarity with Python, basic concepts of calculus and statistics.

Goal: Understand and appreciate how GPT works.

Resources you might need: Colab Notebook Link Detailed notes found on Twitter

ChatGPT

From the debut of ChatGPT in late 2022 to today’s GPT-4 and Claude 3, Large Language Models (LLMs) have integrated into the daily lives of many. They are probabilistic systems; for the same prompt, their answers vary. Compared to the language models we built previously, GPT can model sequences of words, characters, or more general symbols, knowing how words in English tend to follow each other. From these models’ perspective, our prompt is the beginning of a sequence, and the model’s job is to complete it.

So, what is the neural network that models these word sequences?

Transformer

In 2017, the milestone paper 《Attention is All You Need》 proposed the Transformer architecture. GPT, which we know well, stands for Generative Pre-trained Transformer. Although the original paper targeted machine translation, its profound impact has reached the entire AI field. Minor modifications to this architecture allow it to be applied to a vast array of AI applications, and it is the core of ChatGPT.

Of course, our goal isn’t to train a ChatGPT—that’s a massive industrial project involving enormous data training, pre-training, and fine-tuning. We aim to train a Transformer-based language model, which, like our previous ones, will be character-level.

Building the Model

Dataset

We’ll use the toy-scale “Tiny Shakespeare” dataset, a favorite of Andrej Karpathy. It’s a collection of Shakespeare’s works, roughly 1MB in size. Note that while ChatGPT outputs tokens (chunks of words), we’ll start with characters.

# Always start with a dataset. Download Tiny Shakespeare.
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
 
# Read to check
with open('input.txt', 'r', encoding='utf-8') as f:
	text = f.read()

Dataset preview

Tokenize

chars = sorted(list(set(text))) # Get unique characters and sort them
vocab_size = len(chars)
print(''.join(chars)) # Merge into one string
print(vocab_size)
 
# Output (Sorted by ASCII):
# !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz 
# 65

This character table logic is identical to our previous sections. We then need a way to tokenize the input—converting the raw text string into a sequence of integers. For our character-level model, this just means mapping each character to a number.

If you’ve followed the previous sections, this code should feel familiar, much like the lookup table in Bigram.

# Map characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: string to list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: list of integers to string
 
print(encode("hii there"))
print(decode(encode("hii there")))

We’ve built an encoder and decoder to translate between strings and integers at the character level. This is a very simple tokenization algorithm. Many methods exist, like Google’s SentencePiece, which splits text into subwords (commonly used in practice), and OpenAI’s TikToken, which uses byte pairs.

Tiktoken comparison

Using tiktoken: GPT-2’s vocabulary has 50,257 tokens. For the same string, it uses only 3 integers compared to our simple algorithm.

# Encode the entire dataset into a torch.tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

In GPT, the first 1000 characters look like this:

Encoded data

Encoded data detail

As seen, 0 is space and 1 is a newline.

Currently, the entire dataset is represented as a massive sequence of integers.

Train/Val Split

# Split data into training and validation sets to check for overfitting
n = int(0.9*len(data)) # First 90% for training, rest for validation
train_data = data[:n]
val_data = data[n:]

We don’t want the model to perfectly memorize Shakespeare; we want it to learn his style.

Chunks & Batches

We won’t feed the entire text into the Transformer at once. Instead, we’ll use random chunks (samples) from the training set.

Chunking

Block Size specifies the fixed length of each input data chunk.

block_size = 8
train_data[:block_size+1]
 
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

Block context

This strategy gradually reveals context to the model.

This forces the model to learn to predict the next character based on preceding ones, improving reasoning ability.

Batching

To utilize GPU parallelism, we pile multiple text chunks into a single tensor, processing independent data blocks simultaneously.

The batch size determines how many independent sequences the Transformer processes in one forward/backward pass.

torch.manual_seed(1337) # For reproducibility
batch_size = 4 # Number of independent sequences to process in parallel
block_size = 8 # Max context length for prediction
 
# Similar to torch DataLoader
def get_batch(split):
	# Generate a small batch of inputs x and targets y
	data = train_data if split == 'train' else val_data
	ix = torch.randint(len(data) - block_size, (batch_size,))
	x = torch.stack([data[i:i+block_size] for i in ix])
	y = torch.stack([data[i+1:i+block_size+1] for i in ix])
	return x, y
 
xb, yb = get_batch('train')

torch.stack concatenates tensors along a new dimension.

Batch shapes

xb shape is 4x8, with each row being a chunk from the training set. yb targets are used to calculate the loss function.

for b in range(batch_size): # Batch dimension
	for t in range(block_size): # Time dimension
	context = xb[b, :t+1]
	target = yb[b,t]

Inputs and outputs

This clarifies the relationship between inputs and outputs.

Bigram

We previously implemented a Bigram language model in Makemore; now we’ll reimplement it quickly using torch.nn.Module.

Model Construction

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
 
class BigramLanguageModel(nn.Module):
 
	super().__init__()
		# Each token reads logits for the next token directly from a lookup table
		self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

The embedding layer is familiar: if the input is 24, it retrieves the 24th row of the table.

	def forward(self, idx, targets=None):
			
		# idx and targets are (B, T) tensors of integers
		logits = self.token_embedding_table(idx) # (Batch=4, Time=8, Channel=65)
		if targets is None:
			loss = None
		else:
			B, T, C = logits.shape
			logits = logits.view(B*T, C)
			targets = targets.view(B*T)
			loss = F.cross_entropy(logits, targets)
		
		return logits, loss

We know negative log-likelihood is a good loss measure, implemented as “cross-entropy” in PyTorch. Intuitively, the model should assign a high probability to the correct label and very low probabilities elsewhere. The expected loss is roughly -log(1/65) ≈ 4.17, though actual results are slightly higher due to entropy.

# Generation from the model
def generate(self, idx, max_new_tokens):
	# idx is (B, T) array of indices in current context
	
	for _ in range(max_new_tokens):
		# Get predictions
		logits, loss = self(idx)
		# Focus only on the last time step
		logits = logits[:, -1, :] # Becomes (B, C)
		# Apply softmax for probabilities
		probs = F.softmax(logits, dim=-1) # (B, C)
		# Sample from distribution
		idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
		# Append sampled index to sequence
		idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
	return idx
 
print(loss)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist())) # Start with newline (0)

generate expands the context idx from BxT to $B imes (T+1), (T+2), fasis$ across all batches in the time dimension.

Untrained generation

Random output from an untrained model.

Model Training

We’ll use the popular AdamW optimizer instead of Makemore’s SGD.

# Create PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

The optimizer computes gradients and updates parameters.

batch_size = 32 # Larger batch size
for steps in range(100): # Increase steps for better results
 
    # Sample a batch
    xb, yb = get_batch('train')
 
    # Evaluate loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True) # Clear previous gradients
    loss.backward() # Backpropagation
    optimizer.step() # Update parameters
 
	print(loss.item())

Training loss

Optimization is working; loss is decreasing.

With more training, loss reached ~2.48. Sampling again yields better results.

Trained generation

Better, but still not quite right.

Bigram models are limited because they only predict based on the single previous token. There’s no connection between tokens further back, hence the need for Transformers.

Transformer

If you have an NVIDIA GPU, you can accelerate training:

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Setting this requires adjusting code to ensure data loading, computation, and sampling happen on the GPU. See Andrej’s lecture repository for the starting bigram.py.

We’ve split the model into training and evaluation phases. Currently, only nn.Embedding exists, so both phases behave identically. Adding dropout or batch norm layers later makes this split crucial—a best practice in model training.

Self-Attention

Before the full Transformer, we’ll learn a mathematical trick for implementing self-attention.

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
# x.shape = torch.Size([4, 8, 2])

We want tokens to interact specifically: e.g., the 5th token shouldn’t “see” tokens 6, 7, or 8 because they are in the future. It only communicates with tokens 1-4. Information flows from previous context to the current step to predict the future.

The simplest way for tokens to communicate? Averaging preceding tokens to form a historical feature vector. However, this loses spatial arrangement information.

v1. For Loop

# Target: x[b,t] = mean_{i<=t} x[b,i]
 
xbow = torch.zeros((B,T,C))
for b in range(B):
	for t in range(T):
		xprev = x[b,:t+1] # (t,c)
		xbow[b,t] = torch.mean(xprev, 0)

Mean example

The code calculates cumulative averages.

This is inefficient. Matrix multiplication can do this better.

v2. Matrix Multiplication

# Simplified example: weighted aggregation using matrix multiplication.
 
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b

Standard matrix multiplication: each element in c is the dot product of a row from a and a column from b.

Matrix multiplication

To achieve our goal, use a lower triangular matrix:

\begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ \end{bmatrix}

This sums the first $n$ rows, implemented via torch.tril.

a = torch.tril(torch.ones(3, 3))

Normalize each row so they sum to 1 for weighted aggregation:

a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True) # keepdim for broadcasting

Normalized matrix

Each row sums to 1; c is now the mean of preceding rows in b.

Applying this:

# v2: Weighted aggregation via matrix multiplication
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)

Weight matrix

The weight matrix corresponds to matrix a above.

# torch adds a batch dimension to wei
xbow2 = wei @ x # (T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2) # True: same effect

Summary: Batch matrix multiplication with a lower triangular weight matrix performs weighted aggregation, where token $t$ only sees tokens $i fasis$ t$.

v3. Softmax

We can also use Softmax for a third version.

torch.masked_fill() fills a tensor based on a mask:

Masked fill

Applying Softmax to each row performs the same weighted aggregation:

# v3: Using Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x

Beyond encoding identity, we encode position:

class BigramLanguageModel(nn.Module):
 
	super().__init__()
		# Tokens read logits from a lookup table
		self.token_embedding_table = nn.Embedding(vocab_size, n_embd) # token encoding
		self.position_embedding_table = nn.Embedding(block_size, n_embd) # position encoding
	
	def forward(self, idx, targets=None):
			B, T = idx.shape
 
		# idx and targets are (B,T) integer tensors
		tok_emb = self.token_embedding_table(idx) # (B,T,C)
		pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
		x = tok_emb + pos_emb # (B,T,C)
		x = self.blocks(x) # (B,T,C)
		x = self.ln_f(x) # (B,T,C)
		logits = self.lm_head(x) # (B,T,vocab_size)

x stores the sum of token and position embeddings. In Bigram, positions are translationally invariant, but that changes with attention.

v4. Self-Attention

Simple averaging treats all tokens equally. In reality, tokens have different meanings—they are data-dependent. For example, a vowel might want to “know” which consonants precede it. Self-attention solves this.

Each token emits two vectors: a query (what am I looking for?) and a key (what do I contain?).

Affinity (weights) between tokens is the dot product of keys and queries. High alignment results in a high weight, focusing the model’s attention on that specific token’s information.

Attention computes matches between queries and keys to assign weights, allowing the model to focus on the most relevant information.

We also need a Value (what information do I contribute if you’re interested in me?). We aggregate these values (passed through a linear layer) instead of raw x.

Attention mechanism

Implementing single-head self-attention:

# v4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)
 
# Single-head self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
 
k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)
# Dot product for affinities
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
 
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
 
v = value(x)
out = wei @ v

Attention weights

wei weights are now data-dependent: tokens with high affinity contribute more information.

Attention Summary

Attention is a communication mechanism. Nodes in a directed graph aggregate information from connected nodes via weighted sums, where weights are data-dependent.
It lacks spatial concepts. Attention acts on a set of vectors, which is why position encoding is necessary.
Batch examples are handled independently.
Attention doesn’t strictly care about the past. Our implementation masks future information using masked_fill, but removing it allows all-to-all communication (Encoder style). Ours is a “Decoder” module due to the triangular mask.
“Self-attention” means keys and values come from the same source as the query (x). “Cross-attention” uses queries from x but keys/values from an external source (e.g., an encoder).
“Scaled” Attention: Divide wei by $\sqrt{head\_size}$ .

Scaled attention

With unit Gaussian inputs, raw weighted sum wei has variance proportional to the head size.

Variance before

Normalization brings variance back to 1:

Variance after

Why does this matter? Softmax makes distributions with large absolute values more “peaky” (approaching one-hot).

Softmax effect

Scaling controls initialization variance, keeping Softmax diffuse and preventing gradient issues.

Code Implementation

class Head(nn.Module):
	""" one head of self-attention """
 
	def __init__(self, head_size):
		super().__init__()
		self.key = nn.Linear(n_embd, head_size, bias=False)
		self.query = nn.Linear(n_embd, head_size, bias=False)
		self.value = nn.Linear(n_embd, head_size, bias=False)
		self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
	
	def forward(self, x):
		B,T,C = x.shape
		k = self.key(x) # (B,T,C)
		q = self.query(x) # (B,T,C)
		# Compute affinities, apply scaling
		wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
		wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
		wei = F.softmax(wei, dim=-1) # (B, T, T)
		
		# Weighted aggregation
		v = self.value(x) # (B,T,C)
		out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
		return out

tril is a buffer, not a parameter, so we use register_buffer.

Multi-Head Attention

Multi-head attention applies multiple heads in parallel and concatenates results:

Multi-head attention

Code Implementation

In PyTorch, just create a list of heads.

class MultiHeadAttention(nn.Module):
	""" Multiple heads of self-attention in parallel. """
  
	def __init__(self, num_heads, head_size):
		super().__init__()
		self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
	
	def forward(self, x):
		return torch.cat([h(x) for h in self.heads], dim=-1) # Concatenate on channel dim

We now have multiple communication channels. It’s similar to group convolution.

class BigramLanguageModel(nn.Module):
 
	def __init__(self):
		super().__init__()
		self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
		self.position_embedding_table = nn.Embedding(block_size, n_embd)
		self.sa_heads = MultiHeadAttention(4, n_embd//4)
		self.lm_head = nn.Linear(n_embd, vocab_size)

Blocks

The Transformer architecture repeats blocks containing multi-head attention and a feed-forward part.

Feed-Forward Network

Transformer architecture

The feed-forward part is a simple MLP:

Feedforward dimensions

The paper uses 512 for IO and 2048 for the inner layer, so we multiply inner channel size by 4.

class FeedForward(nn.Module):
	""" Simple linear layer followed by non-linearity """
 
	def __init__(self, n_embd):
		super().__init__()
		self.net = nn.Sequential(
		    nn.Linear(n_embd, 4 * n_embd),
		    nn.ReLU(),
	    )
	
	def forward(self, x):
		return self.net(x)
 
class Block(nn.Module):
	""" Transformer block: decouples communication and computation """
	
	def __init__(self, n_embd, n_head):
		super().__init__()
		head_size = n_embd // n_head
		self.sa = MultiHeadAttention(n_head, head_size) # Communication
		self.ffwd = FeedForward(n_embd) # Computation
	
	def forward(self, x):
		x = self.sa(x)
		x = self.ffwd(x)
		return x
 
class BigramLanguageModel(nn.Module):
 
	def __init__(self):
		super().__init__()
		self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
		self.position_embedding_table = nn.Embedding(block_size, n_embd) 
		self.blocks = nn.Sequential(
			Block(n_embd, n_head=4),
			Block(n_embd, n_head=4),
			Block(n_embd, n_head=4),
		)
		self.lm_head = nn.Linear(n_embd, vocab_size)

Adding blocks doesn’t immediately improve decoding much. We’ve created a deep network that suffers from optimization issues. We need solutions from the paper.

Residual Connections

Residual connections (Skip connections) were proposed in Deep Residual Learning for Image Recognition.

Residual connection

Andrej: “You transform data, then have a skip connection and add it from the previous features.”

Residual explanation

Transform data: Data undergoes weights and non-linearities to learn abstract representations.
Skip connection: Break sequential patterns by connecting inputs directly to later layers.
Addition: Element-wise addition ensures original features propagate without being “diluted.”

Residual connections facilitate learning identity mappings, crucial for deep networks. They allow gradients to flow directly, mitigating vanishing/exploding gradients.

Residual gradient

In micrograd, we saw addition nodes distribute gradients equally to all inputs. Everything is connected.

class MultiHeadAttention(nn.Module):
	def __init__(self, num_heads, head_size):
		super().__init__()
		self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
		self.proj = nn.Linear(n_embd, n_embd) # Projection
	 
	def forward(self, x):
		out = torch.cat([h(x) for h in self.heads], dim=-1)
		out = self.proj(out) # Linear transformation for residual path
		return out
 
class FeedForward(nn.Module):
	def __init__(self, n_embd):
		super().__init__()
		self.net = nn.Sequential(
		    nn.Linear(n_embd, 4 * n_embd),
		    nn.ReLU(),
		    nn.Linear(4 * n_embd, n_embd), # Project back to residual path
	    )
	
	def forward(self, x):
		return self.net(x)
		
class Block(nn.Module):
	def __init__(self, n_embd, n_head):
		super().__init__()
		head_size = n_embd // n_head
		self.sa = MultiHeadAttention(n_head, head_size)
		self.ffwd = FeedForward(n_embd)
	
	def forward(self, x):
		x = x + self.sa(x) # Residual addition
		x = x + self.ffwd(x)
		return x

Layer Norm

The second optimization is Layer Norm:

Layer norm

Similar to Batch Norm, but normalizes across features instead of the batch dimension. It computes mean and std for all features of each sample. While the original paper applies it after transformation, modern practice uses “Pre-norm”—applying Layer Norm before the transformation.

We now have a fairly complete Transformer (decoder-only).

class Block(nn.Module):
	def __init__(self, n_embd, n_head):
		super().__init__()
		head_size = n_embd // n_head
		self.sa = MultiHeadAttention(n_head, head_size)
		self.ffwd = FeedForward(n_embd)
		self.ln1 = nn.LayerNorm(n_embd)
		self.ln2 = nn.LayerNorm(n_embd)
	
	def forward(self, x):
		x = x + self.sa(self.ln1(x))
		x = x + self.ffwd(self.ln2(x))
		return x
 
class BigramLanguageModel(nn.Module):
	def __init__(self):
		super().__init__()
		self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
		self.position_embedding_table = nn.Embedding(block_size, n_embd) 
		self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
		self.ln_f = nn.LayerNorm(n_embd) # Final Layer Norm
		self.lm_head = nn.Linear(n_embd, vocab_size)

Dropout

Dropout randomly shuts off neurons during forward/backward passes to prevent overfitting, as proposed in Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

Screenshot 2024-03-20 at 15.08.17

# Hyperparameters
batch_size = 64 
block_size = 256 # Context length
max_iters = 5000
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
n_embd = 384
n_head = 6
n_layer = 4
dropout = 0.2
 
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
	
    def forward(self, x):
        # ... same as before
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei) # Dropout on attention weights
        # ...

Screenshot 2024-03-20 at 15.38.50

The output looks like Shakespearean gibberish, quite impressive for a small model.

We didn’t implement the encoder or cross-attention from the original paper.

Screenshot 2024-03-20 at 15.42.18

We use decoder-only because we’re generating unconditional text. The triangular mask ensures autoregressive property for language modeling.

The original paper’s encoder-decoder structure is for translation: encoding a sentence (e.g., French) and decoding it into English.

Screenshot 2024-03-20 at 15.52.30

See karpathy/nanoGPT for a decoder-only pre-training implementation.

Back to ChatGPT

Training ChatGPT involves two stages: pre-training and fine-tuning.

Pre-training

Train on a massive Internet corpus to get an encoder-only Transformer. OpenAI uses tokenizers (byte pairs). Our Shakespeare model has ~10M parameters; GPT-3 has 175B, trained on 300B tokens.

Screenshot 2024-03-20 at 16.13.48

After this, the model only completes sequences; it can’t answer questions.

Fine-tuning

Alignment: Fine-tune on thousands of “Question: Answer” documents.
Reward Model: Humans rank model responses.
PPO: Use Proximal Policy Optimization to fine-tune the sampling policy, turning the document completer into a chatbot.

Screenshot 2024-03-20 at 16.16.28

Andrej’s Microsoft Build 2023 talk provides a comprehensive overview: State of GPT.