The Evolution of LLM (Part 2): Word Embeddings - Deep Connections Between Multilayer Perceptrons and Language

The source code repository address for this section is source code repository

"A Neural Probabilistic Language Model"#

This article is considered a classic work in training language models. Bengio introduced neural networks into the training of language models and obtained the byproduct of word embeddings. Word embeddings have made significant contributions to subsequent deep learning in natural language processing and are an effective method for obtaining semantic features of words.

The proposal of the paper stems from addressing the problem of dimensionality disaster caused by original word vectors (one-hot representation). The authors suggest solving this problem by learning distributed representations of words. Based on the n-gram model, the authors trained the neural network using a corpus to maximize the prediction of the current word based on the previous n words. The model simultaneously learned the distributed representation of each word and the probability distribution function of word sequences.

w2v

Classic formula

The word representations learned by this model differ from traditional one-hot representations; they can represent the similarity between words through the distance between word embeddings (Euclidean distance, cosine distance, etc.). For example, in the sentence: The cat is walking in the bedroom A dog was running in a room, cat and dog have similar semantics, and you can convey knowledge through the embedding space, generalizing it to new scenarios.

Pasted image 20240125170634

$index\in(0,16999)$
Shared look-up table
The size of the hidden layer is a hyperparameter
The output layer has 17,000 neurons, fully connected to the neurons in the hidden layer, i.e., logits

In neural networks, the term "logits" typically refers to the output of the last linear layer (i.e., the output that has not been processed by an activation function). In classification tasks, the output of this linear layer is fed into the softmax function to generate a probability distribution. Logits are essentially the model's unnormalized prediction scores for each class, which can be viewed as reflecting the model's confidence level for each class.

Why are they called "logits"?#

This term comes from logistic regression, where the "logit" function is the inverse of the logistic function. In binary logistic regression, the relationship between the output probability $p$ and the logit $L$ can be expressed as:

$L = \log\left(\frac{p}{1 - p}\right)$
${p}/{1-p}$ represents the odds, which is the ratio of the probability of an event occurring to the probability of it not occurring. Compared to probabilities, the advantage of odds is that it expands the output range to the entire real number range $(-\infty,+\infty)$ , maintains a linear relationship between features and outputs, and simplifies the likelihood function.

Here, $L$ is the logit. In neural networks, although the logit function is not directly used, the term "logits" is still employed to describe the raw outputs of the network because these raw outputs are similar to the logits in logistic regression before being processed by the softmax function.

In multi-class problems, the network's logits are typically a vector where each element corresponds to the logit of a class. For example, if a model is processing handwritten digit recognition (such as the MNIST dataset), the output logits will be a vector with 10 elements, each corresponding to a predicted score for a digit class (0 to 9).

The softmax function converts logits into a probability distribution:

$\text{softmax}(\text{logits})_i = \frac{e^{\text{logits}_i}}{\sum_{j} e^{\text{logits}_j}}$

Here, $e^{\text{logits}_i}$ exponentiates the $i$ -th logit to make it positive and normalizes it by dividing by the sum of all exponentiated logits, resulting in a valid probability distribution.

Summary: In the context of neural networks, logits are the model's raw prediction outputs for each class, typically before applying the softmax function. These raw scores reflect the model's prediction confidence and are used to compute loss during training, particularly cross-entropy loss, which is commonly used in many classification tasks.

Building NPLM#

Creating the Dataset#

# build the dataset

block_size = 3 # context length: how many characters do we take to predict the next one
X, Y = [], []

for w in words:
    print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '--->', itos[ix])
        context = context[1:] + [ix] # crop and append
    
X = torch.tensor(X)
Y = torch.tensor(Y)

Screenshot 2024-02-14 at 21.38.58

Now let's build the neural network to use X to predict Y.

Lookup Table#

Screenshot 2024-02-14 at 21.41.36

We want to embed the possible 27 characters into a low-dimensional space (the original paper embedded 17,000 words into a 30-dimensional space).

C = torch.randn((27,2))

F.one_hot(torch.tensor(5), num_classes=27).float() @ C
# (1, 27) @ (27, 2) = (1, 2)

This is equivalent to keeping only the fifth row of C.

In other words, the reduction of computational load is not due to the emergence of word vectors, but because the matrix operation of one-hot has been simplified to a lookup operation.

Screenshot 2024-02-14 at 22.03.32

Screenshot 2024-02-16 at 16.37.53

Hidden Layer#

Screenshot 2024-02-16 at 17.36.43

W1 = torch.randn((6, 100)) # Number of inputs: 3 x 2 = 6, 100 neurons
b1 = torch.randn(100)

emb @ W1 + b1 # We want to obtain this form

However, since the shape of emb is [228146, 3, 2], how do we combine 3 and 2 to become 6?

torch.cat(tensors, dim,):

Screenshot 2024-02-16 at 16.51.39

Since block_size can change, we want to avoid hardcoding forms, so we use torch.unbind(tensors, dim,) to get slice tuples:

Screenshot 2024-02-16 at 16.55.57

The above method will create new memory; is there a more efficient way?

tensor.view():

a = torch.arange(18)
 
a.storage() # 0 1 2 3 4 ... 17

Screenshot 2024-02-16 at 16.59.47

This approach is efficient. Each tensor has an underlying storage form, which is the stored numbers themselves, always a one-dimensional vector. When calling view() on it, we are merely changing the way this sequence is interpreted; there is no change, copying, moving, or creation of memory in this process, meaning the storage between the two remains the same.

Screenshot 2024-02-16 at 17.04.47

So the final form is as follows:

emb.view(emb.shape[0], 6) @ W1 + b1
# or 
emb.view(-1, 6) @ W1 + b1

Adding a non-linear transformation:

h = torch.tanh(emb.view(-1, 6) @ W1 + b1)

Output Layer#

Screenshot 2024-02-16 at 17.36.43

# 27 possible output characters
W2 = torch.randn((100, 27)) 
b2 = torch.randn(27)

logits = h @ W2 + b2

As in the previous section, we obtain counts and probabilities:

counts = logits.exp()
prob = counts / counts.sum(1, keepdim=True)

Negative log likelihood loss:

loss = -prob[torch.arange(Y.shape[0]), Y].log().mean()

Screenshot 2024-02-17 at 14.05.39

The current loss is over 19, which is our starting point for training optimization.

Let's reorganize our neural network:

g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27,2), generator=g)
W1 = torch.randn((6,100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100,27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

sum(p.nelement() for p in parameters) # 3481

# forward pass
emb = C[X] # (228146, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (228146, 100)
logits = h @ W2 + b2 # (228146, 27)

Here we can use PyTorch's cross-entropy loss function to replace the previous code:

# counts = logits.exp()
# prob = counts / counts.sum(1, keepdim=True) 
# loss = -prob[torch.arange(Y.shape[0]), Y].log().mean()
loss = F.cross_entropy(logits, Y)

Screenshot 2024-02-17 at 14.08.03

The results are exactly the same.

In practice, the PyTorch implementation is often used because it allows all operations to be performed in a single fused kernel, avoiding the creation of additional intermediate memory for storage tensors, and the expressions are simpler, resulting in higher efficiency for forward and backward passes. Additionally, in the teaching implementation, if there is a large count, it may overflow to become NaN after exp.

How does the PyTorch implementation solve this problem?
For example, for logits = torch.tensor([1,2,3,4]) and logits = torch.tensor([1,2,3,4]) - 4, although their absolute values differ, their relative differences remain unchanged. The softmax function is sensitive to the relative differences of the inputs, not their absolute values.

$\text{softmax}(\text{logits})_i = \frac{e^{\text{logits}_i}}{\sum_{j} e^{\text{logits}_j}}$

When you subtract a constant from logits, the properties of the exponential function cause the exponentials of each logit to decrease by the same factor. However, since this constant is subtracted from each logit, it cancels out in both the numerator and denominator, thus not affecting the final probability distribution. In other words, for any logits vector and constant C:

$\frac{e^{\text{logits}_i - C}}{\sum_{j} e^{\text{logits}_j - C}} = \frac{e^{\text{logits}_i} / e^C}{\sum_{j} e^{\text{logits}_j} / e^C} = \frac{e^{\text{logits}_i}}{\sum_{j} e^{\text{logits}_j}}$

PyTorch prevents numerical overflow when calculating e^logits by internally computing the maximum value in logits and then subtracting this value.

Training#

for p in parameters:
    p.requires_grad_()
    
for _ in range(10): # The dataset is large, testing if the optimization is successful
    # forward pass
    emb = C[X] # (228146, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (228146, 100)
    logits = h @ W2 + b2 # (228146, 27)
    loss = F.cross_entropy(logits, Y)

    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    for p in parameters:
        p.data += -0.1 * p.grad 

print(loss.item())

Screenshot 2024-02-17 at 14.58.56

Since the training is done on the entire dataset, the loss can only reach a relatively small value. You can see that the output results have some similarity to the correct results. If only one batch is used for training, it can achieve overfitting, where the predicted results are almost identical to the correct results. Fundamentally, the loss will not be very close to 0 because ... also needs to be predicted, and many letters could be possible, so complete overfitting is not feasible.

Mini-batch#

Since we need to backtrack through 220,000 data points, each iteration is slow and computationally intensive. In practice, it is common to update and evaluate performance on many small batches of data during forward and backward passes. What we need to do is randomly select a portion of the dataset, which is called mini-batch, and then iterate over these small batches of data.

for _ in range(1000):
    # mini-batch
    ix = torch.randint(0,X.shape[0],(32,))

    # forward pass
    emb = C[X[ix]] # (32, 3, 2)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Y[ix])

Now the iteration speed has become very fast.

Doing this does not optimize the true gradient and correct direction; instead, it takes several steps on an approximate gradient, which is very effective in practice.

Screenshot 2024-02-17 at 16.32.09

Learning Rate#

lre = torch.linspace(-3, 0, 1000)
lrs = 10**lre # (0.001 - 1)

lri = []
lossi = []

for i in range(1000):
	'''
	mini-batch, forward and backward pass code 
	'''
	# update
    lr = lrs[i]
    for p in parameters:
        p.data += -lr * p.grad 
	# tracks stats
	lri.append(lre[i])
	lossi.append(loss.item())

plt.plot(lri, lossi)

Screenshot 2024-02-17 at 16.46.32

As shown, the loss reaches a minimum near -1.0, and the learning rate at this point is $10^{-1}=0.1$.

Now we have confidence in selecting the learning rate.

	lr = 0.1
    for p in parameters:
        p.data += -lr * p.grad

After running several iterations of 10,000 steps, the loss stabilizes around 2.4. At this point, we can lower the learning rate (learning rate decay), such as reducing it by ten times to 0.01 and training for several rounds, thus obtaining a roughly trained network.

Screenshot 2024-02-17 at 16.55.37

The loss is much lower than that of the previous bigram model, so can we say this model is better than the previous one?

Actually, this statement is not correct. If we increase the number of parameters, the loss of this model can even approach 0, but sampling from it would only yield examples identical to those in the training set, and the loss on unseen words could be very high, so this is not a good model.

This leads to the standard practice in this field: splitting the dataset into three parts, namely training split, validation (dev) split, and test split, which we are familiar with as training set, validation set, and test set.

Training Split:
- Purpose: Used to train the model's parameters, i.e., the weights and biases in the model.
- Process: During training, the model attempts to learn the features and patterns of the data and adjusts its parameters through optimization algorithms like backpropagation and gradient descent to minimize the loss function.
Validation Split:
- Purpose: Used to train (tune) the model's hyperparameters, such as learning rate, number of layers, size of layers, etc.
- Process: Throughout the training process, we continuously evaluate the model's performance on the validation set to adjust and select the best hyperparameters. The validation set helps us assess the model's generalization ability without touching the test set, avoiding overfitting.
Test Split:
- Purpose: Used to evaluate the final performance of the model, i.e., the model's potential performance in real applications.
- Process: During the model development phase, the test set is completely untouched. Only after the model has been trained and all hyperparameters have been tuned on the validation set do we use the test set for testing. This provides an unbiased evaluation on unseen data, giving a true representation of the model's performance on new data.

This partitioning method helps researchers and developers avoid data leakage and overfitting, both of which can lead to models performing well on the training set but poorly on unseen new data. By doing so, we can more confidently predict the model's performance in the real world.

We will encapsulate the dataset building process into a function and then perform the three-part split:

def build_dataset(words):
	'''
	previous code
	'''
	return X, Y

import random

random.seed(42)
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

Xtr, Ytr = build_dataset(words[:n1])     # 80% train
Xdev, Ydev = build_dataset(words[n1:n2]) # 10% validation
Xte, Yte = build_dataset(words[n2:])     # 10% test

The data volume of the three parts.

Modify the neural network training part:

ix = torch.randint(0,Xtr.shape[0],(32,))
loss = F.cross_entropy(logits, Ytr[ix])

Screenshot 2024-02-17 at 17.19.38

The data volume of the three parts.

Modify the neural network training part:

ix = torch.randint(0,Xtr.shape[0],(32,))
loss = F.cross_entropy(logits, Ytr[ix])

Screenshot 2024-02-17 at 17.25.07

You can see that the losses of the training set and validation set are close, so our model is not powerful enough to overfit. This state is called underfitting, which usually means our network has too few parameters.

The simplest way is to increase the number of neurons in the hidden layer:

Screenshot 2024-02-17 at 17.29.12

Now there are over 10,000 parameters, which is a significant increase from the original 3,000 parameters.

Screenshot 2024-02-17 at 17.34.49

You can see that the optimization process of the loss function is "thick" because training on mini-batches generates some noise.

The current network performance is still poor, and the bottlenecks affecting performance include:

Mini-batch size is too small, leading to excessive noise.
The embedding method is problematic, placing too many characters into a two-dimensional space, which the neural network cannot utilize effectively.

Visualizing the current embeddings:

plt.figure(figsize=(8, 8))
plt.scatter(C[:,0].data, C[:,1].data, s=200)
for i in range(C.shape[0]):
    plt.text(C[i,0].item(), C[i,1].item(), itos[i], ha="center", va="center", color="white")
plt.grid('minor')

Screenshot 2024-02-17 at 17.49.38

You can see that there are training results; g, q, p, and . are treated as special vectors, while x, h, b, etc., are viewed as similar and interchangeable vectors.

The embedding vectors likely affect the network's bottleneck.

C = torch.randn((27,10), generator=g)
W1 = torch.randn((30,200), generator=g)
b1 = torch.randn(200, generator=g)
W2 = torch.randn((200,27), generator=g)

Screenshot 2024-02-17 at 17.57.25

Screenshot 2024-02-17 at 18.05.37

The loss is smaller than before, indicating that the embedding vectors indeed have a significant impact.

Further optimization strategies include:

Embedding vector dimensions
Context length
Number of neurons in the hidden layer
Learning rate
Training practices
......

Sampling#

Finally, let's sample to see the current performance of the model:

g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
        emb = C[torch.tensor([context])]  # (1, block_size, D)
        h = torch.tanh(emb.view(1, -1) @ W1 + b1)
        logits = h @ W2 + b2
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1, generator=g).item()
        context = context[1:] + [ix]
        out.append(ix)
        if ix == 0:
            break
    print(''.join(itos[i] for i in out))

Screenshot 2024-02-17 at 18.10.40

It shows some human-like results, indicating there is still room for improvement.

Next, we will introduce modern models such as CNN, GRU, and Transformers.