This article is a summary of Andrej Karpathy's speech at Microsoft Build in March 2023.
The presentation Beamer can be found at: https://karpathy.ai/stateofgpt.pdf
The speech introduced the training process of GPT, its development progress, the current LLM ecosystem, and future prospects. Even a year later, it remains relevant and can be compared and analyzed with current developments.
Overview#
The pretraining part occupies the vast majority of the training time, while the subsequent three stages are all fine-tuning stages.
Data collection#
Training data mixture used in Meta's LLaMA model
Composition of the mixture:
- CommonCrawl: A regular web crawler
- C4: A huge and clean version of Common Crawl.
- Other high-quality datasets as shown
These knowledge sources are mixed together and then sampled according to a given ratio to form the training set for the GPT neural network.
Pre-training#
Before actual training, the following preprocessing steps are needed.
tokenization#
Essentially, this is a lossless transformation process that converts the raw text crawled from the internet into a sequence of integers (sequence of integers).
Methods like byte-pair encoding (BPE) can be used to iteratively merge small text blocks and group them into tokens.
Hyperparameter table for Transformer neural networks
-
Vocabulary Size:
- Vocabulary size refers to the total number of unique words (including words, characters, or subword units) that the model can recognize and generate. This number defines the range of possibilities for model input and output.
- In the Bigram example, we specify the letters for output prediction, so it is a "character table" with a size of 27.
-
Context Length:
- Context length refers to the maximum number of units (usually words or tokens) that the model can consider when processing text. This length defines the amount of information the model can recall when generating or understanding text.
- In the NPLM model, the context length is set to 3, meaning it predicts the next token based on the previous three tokens.
- Nowadays, models like GPT-4 can accept over a hundred thousand inputs.
-
Parameters:
- The number of parameters refers to the total number of parameters that make up the model, which learn and adjust during training to minimize prediction error. Parameters can be weights and biases, which are used in neural networks to process input data, transform it, and ultimately generate output. The number of parameters in a model is a direct indicator of its size and complexity.
- The code in NPLM counts the number of parameters in the model and attempts to improve fitting by increasing the number of hidden layer neurons, among other methods.
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27,2), generator=g)
W1 = torch.randn((6,100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100,27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]
sum(p.nelement() for p in parameters)
- Training Tokens Total:
- The total number of training tokens refers to the total number of tokens (which can be words, characters, or subword units) that the model processes during training. This number reflects the amount of data the model "sees" and learns during training.
- In the name dataset, there are a total of 30k data points, which can be understood as (1B is , or one billion).
It can be seen that LLaMA's training volume is quite large, achieving better results than GPT-3 with longer training times, so performance cannot be judged solely based on the number of model parameters.
At the beginning, the model's weights are random, so the sampling results are completely random. As training time increases, the samples obtained from the model become more coherent and consistent. From the last part of the figure, it can be seen that this GPT model has learned knowledge about words, spaces, and punctuation.
Training curve#
As training progresses, the focus is on the change in the loss function; lower loss means that this Transformer model assigns a higher probability to the correct next integer in the sequence.
GPT has learned powerful general representations and can be fine-tuned very efficiently for any downstream task.
For example, in a sentiment classification task, I previously participated in related research, where we used a certain sentiment analysis API from Baidu to process the data we collected for positive and negative judgments, obtaining data that could be used in supervised learning, and then trained with other NLP models. However, with the emergence of GPT, we can completely ignore sentiment classification and perform LLM pre-training, efficiently fine-tuning the Transformer model for this task with a small amount of data.
This is because the Transformer model is forced to perform a large amount of multitask in language modeling tasks; it is compelled to understand the structure of the text and all the different concepts within it to predict the next token. This is the GPT-1 phase.
In the GPT-2 phase, it was noted that using prompts is actually better and more effective than fine-tuning.
Base model ≠ Assistant#
The base model is not the ChatGPT we usually use; we use APIs of the GPT model like Davinci and 3.5 Turbo, which are assistants. The true base model is not an assistant but is simply meant to complete documents:
You can turn the base model into an assistant by creating specific few-shot prompts, allowing it to appear as if there is an intermediate interactive document, and then placing your query at the end. The base model will then adjust itself to become an assistant capable of answering questions. However, this is not very reliable and the results may be subpar, so creating a true GPT assistant took a different path.
Supervised Finetuning#
In this stage, small and high-quality datasets are collected through crowdsourcing, gathering prompts and ideal response data. Language modeling continues on this data, with the algorithm unchanged, only the dataset being replaced. This results in a SFT (Supervised Fine-Tuning) model, which is a deployable true assistant with certain effectiveness.
Diagram showing that responses written by crowd workers are required to be friendly, truthful, and harmless, following the instruction document.
RLHF pipeline#
acronym of reinforcement learning from human feedback
Reward Modeling#
In this stage, the collected data is converted into Comparisons, using the same prompt to ask the assistant to complete tasks, allowing the trained SFT model to create multiple versions of responses, and then having people evaluate which version is better (this is very time-consuming).
Then, these versions are classified, with a special reward token added at the end, and essentially only supervising the (green) token using the Transformer. The Transformer will predict the reward corresponding to the response for a given prompt, converting the manually labeled true reward labels into a loss function, and then training our model to produce predictions of rewards that are consistent with the evaluations from human crowdsourcing.
Once the reward model is established, it is not very useful as an assistant by itself, but it is very useful for the subsequent reinforcement learning phase.
Reinforcement Learning#
Based on the reward model, we can score the quality of completions for any given prompt. This stage will conduct reinforcement learning based on the reward model.
The current training objective is the (yellow) completion. Using the SFT model, some completions are created for different prompts, and then a reward token is added again, weighing the language modeling objective according to the rewards indicated by the already fixed reward model.
The figure shows that high absolute values of reward will increase the weight of future token sampling.
Iterating over a large number of prompts and batches, a strategy is obtained that can create completions that score high in the reward model.
This results in a deployable RLHF model.
Why RLHF?#
Performs better, humans prefer the responses they generate.
One possible explanation: the difficulty of judging comparisons and generating is not equal. For example, compared to creating an excellent haiku (a form of classical Japanese poetry), judging which of several haikus is better is clearly much simpler. This can potentially leverage human judgment to create better models.
However, as a result, compared to the base model, the model after RLHF loses some entropy. This means that the model's output variation is small, while the base model has high entropy and produces a variety of possible outputs. Therefore, in tasks like "generate more Pokemon names," the base model may perform better.
Apply LLM Assistant#
Chain of thought#
aka COT
When humans write, we have an independent process in our minds—constantly reviewing what we have written and judging whether it is good enough. We may delete parts or rewrite, and then be satisfied with the result.
From GPT's perspective, this is just a series of tokens; it looks at each token and spends equal computation on each token, without any of the aforementioned inner monologue. In other words, the Transformers model acts as a token simulator; it does not know what it is good at or not, it just fits the next token. The model does not provide feedback, sanity checks, or corrections on any content.
GPT stores a vast amount of fact-based knowledge across various domains, with tens of billions of parameters, giving it strong working memory. Therefore, any content within the context length can be immediately retrieved through the self-attention mechanism within the Transformer, retaining that content without loss. It is akin to having limited perfect memory.
Prompting is merely a way to bridge the cognitive gap between the human brain and the "LLM brain." Therefore, in practice, if you are performing tasks that require reasoning ability, you cannot expect the Transformer to do too much reasoning on each token, so you must distribute the reasoning across more tokens.
Let LLM simulate this reasoning.
Or sample multiple times and then vote on the answer.
Alternatively, we often tell LLM that its answer is incorrect, and it will recognize the error. But if you do not prompt it, it will not provide feedback on its own.
Tree of Thoughts#
By considering multiple potential feasible plans simultaneously and using a value feedback mechanism for decision-making, existing planning methods are expanded. Additionally, this method introduces a self-reflection mechanism that allows the language model to evaluate the feasibility of its generated candidates.
This is achieved through Python glue code (Glue Code) and a single prompt, which also includes tree search, etc.
AlphaGo simulates human decision-making, evaluating all possibilities using Monte Carlo tree search, and then retaining those good choices.
Can be likened to AlphaGo for text.
People are beginning to explore more general techniques rather than a single prompt, more like gluing many prompts together.
Chains / Agents#
- ReAct: Structures responses to prompts as a combination of thoughts, actions, observations, etc., answering questions through a thought process, during which the model can also use external tools.
- AutoGPT: Allows LLM to maintain a task list and recursively break down projects to complete tasks, performing generally but providing some insights.
Condition on good performance#
Prompt Engineering#
LLMs actually have a "psychological quirk" of "not wanting to successfully complete tasks"; they just want to simulate the data from the training set, but you want to complete the task, so you need to make demands.
For example, for a math problem, the training data may contain a student providing an incorrect answer, while a math expert provides a perfect answer. Transformers cannot distinguish which answer is of higher quality. Therefore, you must require it to achieve better performance.
It may sound absurd, but this allows Transformers not to rely on low-quality solutions for probability density.
However, do not let it have "400 IQ," as this may lead to science fiction-like answers.
Tools / Plugins#
LLMs do not inherently know what they are not good at.
Therefore, you can even tell it in the prompt, "you are not good at calculations," and use this token combination for the calculation part (such as by calling a calculator API).
Retrieval-Augmented LLMs#
Search engines are representatives of Retrieval only (only retrieval capability), which was the mainstream before the LLM era.
LLMs are Memory only, and there is a large Retrieval augment space between the two extremes.
Taking LlamaIndex as an example, it can connect different types of data, index this data, and open it to LLM.
The principle is to divide the given document into chunks, then embed them to obtain embedding vectors, which are stored in a vector repository. When asking questions, a query is made to the vector store to retrieve chunks that may be relevant to the proposed task, which are then fed into the prompt to generate answers. This works very well in practice.
It can be likened to a student with excellent memory still hoping to find the exact text in the textbook before the final exam.
Constraint prompting#
This essentially involves techniques that force LLMs to output specific templates.
In the example below, we force the LLM to output in JSON format, and then we can impose additional restrictions on the content placed in these blank spaces in the template.
Microsoft's "guidance" project
Finetuning#
Model fine-tuning means changing the model's weights, and some techniques have made this easier than before. For example, LoRA ensures that you can train only small and sparse parts of the model (only parts allowed to change) with low-precision inference (rather than gradient descent), achieving effective fine-tuning with good results while reducing costs.
Suggestions#
Suggestions are divided into two goals as follows, which also serve as a review.
Goal 1: Achieve optimal performance:
- Use GPT-4.
- Design prompts with detailed task context, relevant information, and instructions.
- If unable to reply via email, ask the task contractor what they would do.
- Add any relevant context or information to the prompt.
- Try using the prompt engineering techniques mentioned in previous slides.
- Experiment with a few examples, ensuring these examples are 1) relevant to the test case, and 2) diverse (if appropriate).
- Use tools/plugins to offload tasks that are difficult for large language models (LLMs) to handle (such as calculations, code execution, etc.).
- Optimize the quality time process/"chains."
- If you are confident that you have maximized prompt utility, consider collecting SFT data and RLHF fine-tuning.
Goal 2: Optimize costs:
- Once optimal performance is achieved, try cost-saving measures (e.g., using shorter prompts, GPT-3.5, etc.).