On April 11, 2022, I wrote the following:

While current models may still have limited capacity, I’m optimistic that, we should see better and better AI models coming out in the future, and one day machines can have enough intelligence to handle tasks and jobs we do today.

7 months later, OpenAI released ChatGPT, the performance of which has stunned the public. It can chat, write stories, debug code…there hasn’t been a chatbot that is so intelligent and so fluent in dialogues with human. Previously, AI applications are largely passive: think about face recognition, speech recognition, OCR, autonomous driving, recommendation systems…they are deployed largely in the backstage and are not something that people directly interact with. ChatGPT is different: it is like a person that everyone can talk to at any time. Its interactive nature quickly lets it become popular around the world.

As we will see, there is not much magic behind ChatGPT. Its building block is Attentions [1] and Transformers [2], which were proposed in 2014 and 2017 respectively, so ChatGPT didn’t employ any radically new technologies. Here is a high-level summary of their approach: First, a large neural network model is trained on Internet data to predict the next word. Then it is fine-tuned on prompt-response dialogue dataset created by human labelers. Its generated responses to prompts are then evaluated by humans, and the model is further trained to maximize human evaluation score. That’s basically it. However, idea is one thing, and how to scale an idea to make it work in real world is another. What is truly marvelous about OpenAI is their exceptional engineering efforts, from data collection to training, monitoring, and deployments. Never underestimate the investments required to make something like ChatGPT!

The Models

ModelDateSize
GPTJune 2018117M
GPT-2February 20191.5B
GPT-3May 2020175B
ChatGPTNov 2022undisclosed
GPT-4March 2023undisclosed

GPT

GPT [3] is not very complicated. It is a transformer model. All it does is to output a probability distribution over the next word given the input context. To generate next word, one selects the word with the highest probability. Here is a snapshot from the paper:

GPT model

GPT model

Given previous context $[u_{i-k}…u_{i-1}]$, the simplest way to generate the next word $u_i$ is to assume that it only depends on its left neighbor $u_{i-1}$, count the occurrence of all words that come after $u_{i-1}$ in the training corpus, and then select the word with the highest frequency. This is the 2-gram model we talked about earlier.

To model context dependence, we need something more sophisticated than that. The Attention model [1] is the exact opposite of the Markov assumption of context independence. It is a simple and efficient way to pay attention to previous context when generating the next word. An output $u_i$ is a weighted sum of all the words in its context $(u_{i-1},\ldots,u_{i-k})$:

$$ u_i = \alpha_1 v_{i-1} + \alpha_2 v_{i-2} + \cdots + \alpha_k v_{i-k}, $$

where the weight $\alpha_i$ represents how much the output should pay attention to $u_i$, and $v_k=Vu_{i-k}$ with learnable parameter matrix $V$. See this post for detail. This way,

  • the computation is easily parallelizable, so is very suitable for GPU computation;
  • can easily scale to long contexts, for example 3000 tokens or even more;
  • can easily scale up model size.

Not all model architectures in deep learning are easily scalable. Scalability is clearly one of the biggest advantages of the Attention model.

The Transformer model [2] is attention layer plus residual layer, linear layer and normalization.

Transformer model architecture

Transformer model architecture

Those extra components are also indispensable:

  • Residual connections ensure that gradients can directly flow backward from output to input. Without it convergence is often difficult.
  • Normalization addresses over-fitting and ensures robust learning. Without it training loss can easily explode.
  • Linear layers add more parameters, increasing model capacities. Without it the loss will be higher, performance will be much worse.

And here you are. 12 layers of Transformer blocks. That’s basically GPT.

In those days, fine-tuning on downstream tasks is standard practice. To apply GPT in each kind of problem, it has to be fine-tuned on each kind of such dataset. Here is a description from the paper [3]:

After training the model with the objective, we adapt the parameters to the supervised target task…we assume a labeled dataset $\mathcal{C}$, where each instance consists of a sequence of tokens, $x^1,\ldots,x^m$, along with a label $y$. The inputs are passed through our pre-trained model to obtain the final transformer block’s activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$:

$$P(y \mid x^1,\ldots,x^m) = \texttt{softmax}(h_l^mW_y).$$

This gives us the following objective to maximize: $$L_2(\mathcal{C}) = \sum_{(x,y)}\log P(y\mid x^1,\ldots,x^m).$$

Since GPT only accepts sequences of fixed length as inputs, training data has to be structured as sequences, with delimiters separating different sections of input. For example, for question answering task, context and choices are delimited before feeding to the model.

To fine-tune GPT on downstream tasks, inputs in datasets have to be transformed to sequence-like structures.

To fine-tune GPT on downstream tasks, inputs in datasets have to be transformed to sequence-like structures.

If a model is trained on large enough data, does it need fine-tuning at all? GPT-2 brought up this idea of “zero-shot”, which was a novel practice at that time.

GPT-2

The GPT-2 [4] model architecture is the same as GPT: Transformers. Beside the differences below, it is basically a larger version of GPT:

  1. GPT-2 was trained on a much larger dataset. GPT model was trained on the BooksCorpus [9] dataset. In GPT-2, they scaped the Internet and created a dataset called WebText, which contains 8 million documents for a size of 40 GB.
  2. In GPT-2, layer normalization was moved to the beginning, rather than the end of each sub-block.
  3. An additional layer normalization was added after the final self-attention block.
  4. Context size was increased from 512 to 1024 tokens.

These changes played very important roles in improving model performance. But the most significant paradigm shift from GPT to GPT-2 is zero-shot. The authors realized that, with enough training data, the model can learn various tasks directly from the dataset, without any explicit supervision (hence the title “Language Models are Unsupervised Multitask Learners”).

Supervised learning and unsupervised or generative learning are not a dichotomy. On the one hand, generative learning is disguised supervised learning. Generative training on text sequences can be seen as repeatedly applying supervised learning to predict a label (the next word) given some text input (context). On the other hand, some supervised learning tasks can be formulated as generative learning tasks. For example, as mentioned in [4], for translation, instead of training on (english, french) input and output pairs, a generative model can be trained on (translate to french, english text, french text) sequences. If the model possesses enough context learning ability, and if such tasks appear naturally and abundantly in the training corpus, which is the case, then we can expect the model to possess at least some language translation ability after training. The two are different ways to formulate the same thing: make predictions given inputs.

An example in GPT-2 paper showing naturally occurring language translation pairs found in training dataset

An example in GPT-2 paper showing naturally occurring language translation pairs found in training dataset

However, generative training on vast amount of unlabeled data provides advantages over supervised training on well-prepared labeled data $(x,y)$. The effort to prepare such labeled data is expensive, labor-intensive and time-consuming, and is also hard to scale up. Let’s take a thought experiment. Suppose supervised learning has 80% efficiency of learning from labeled data, while unsupervised learning has only 20% efficiency of learning from unlabeled data. Even though labeled data provide greater learning efficiency than that of unlabeled data, if the amount of unlabeled data greatly surpasses that of labeled data, then eventually an unsupervised learning model will become more powerful than a supervised one trained on far less amount of data. What’s more, a generative model trained on vast and diverse source of texts can pick up multiple abilities from data, while a supervised one can only perform that one specific task defined by the training dataset $\{(x_i,y_i)\}_{i=1}^{N}$.

GPT-2 is the starting point where we later see amazing performance of generative AI. When the model size gets larger, the performance continues to get better. There is no reason to stop training models with larger sizes.

As model size gets larger, performance gets better.

As model size gets larger, performance gets better.

GPT-3

GPT-3 [5] largely shares the same architecture as GPT-2, but is significantly larger, with 175B parameters. For reference, here is a short description from the paper:

We use the same model and architecture as GPT-2 …… with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

The most silent aspect of the GPT-3 paper is that an extensive series of experiments were conducted, and the GPT-3 model achieved strong performances across multiple tasks. The experiments strongly suggest that scaling up model size is the way to obtain better performance.

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

There is no end in scaling

GPT-3 is able to write stories and news articles. It was pretty impressive at that point, and caught some public attention. But still it was not something that anyone can directly interact with, so many people were still not aware of it.

InstructGPT

GPT-3 is optimized to produce web-like documents. If you give the model an instruction, like “explain …. to me”, it will not know how to respond to this particular instruction, but instead it is likely to produce something that looks like articles that contain similar sentences in the training data. In OpenAI’s term, the model is misaligned.

InstructGPT [6] addresses this problem by resorting to supervised fine-tuning. Below is an overview of the approach from the paper. Step 1 is to fine-tune on high quality human-crafted instruction-response dataset. Step 2 is to train a reward model, and step 3 is to train the model to maximize reward model output. The last two steps are called Reinforcement Learning with Human Feedback (RLHF), which proves to be an effective way of aligning model to human instructions. It shares similarity with GAN, in which a generative model is trained to optimize output score from a discriminative model. In GAN the two models are trained concurrently. Here in InstructGPT the two models are trained separately.

InstructGPT. (1) A pre-trained LM is fine-tuned using supervised learning on human-written dataset. (2) Collect human rankings on model outputs and train a discriminative model on this dataset. (3) Train the fine-tuned LM so that its generated responses to prompts maximize the discriminative model’s output.

InstructGPT. (1) A pre-trained LM is fine-tuned using supervised learning on human-written dataset. (2) Collect human rankings on model outputs and train a discriminative model on this dataset. (3) Train the fine-tuned LM so that its generated responses to prompts maximize the discriminative model’s output.

Here are the methods in detail:

  1. First, a pre-trained GPT-3 model is fine-tuned on prompt-response dataset created by 40 labelers. The dataset contains about 13k training prompts. The fine-tuned model is called SFT (which stands for “Supervised Fine-Tuning”).

  2. Second, given a prompt, sample several outputs (from $K=4$ to $K=9$) from the model. A human labeler then ranks the outputs. Collect such human ranking dataset, and train a reward model (RM) on the dataset. In InstructGPT paper [6], the RM model is a GPT-3 model with 6B parameters. The training objective is the pairwise ranking loss

$$ \text{loss}(\theta) = -\frac{1}{K \choose 2} E_{(x, y_w, y_l)\sim D}[\log(\sigma(r_\theta(x,y_w) - r_\theta(x, y_l)))] $$

where $r_\theta(x, y)$ is the scalar output of the reward model for prompt $x$ and completion $y$ with parameters $\theta$, $y_w$ is the preferred completion out of the pair of $y_w$ and $y_l$, and $D$ is the dataset of human comparisons. The RM dataset has 33k training prompts.

  1. Finally, fine-tune the SFT model again, such that its output will have a high reward score from the RM model. This is called Proxy Policy Optimization (PPO) in InstructGPT paper [6], which is about optimizing (maximizing) the following objective:

$$ \text{objective}(\phi) = E_{(x,y)\sim D_{\pi_\phi^{RL}}}[r_\theta(x,y) - \beta\log\left(\pi_\phi^{RL}(y\mid x) / \pi^{SFT}(y\mid x)\right)] + \gamma E_{x\sim D_{\text{pretrain}}}[\log(\pi_\phi^{RL}(x))]. $$

Here, $\pi^{SFT}$ is the fine-tuned model from step 1, which is kept fixed, and $\pi_{\phi}^{RL}$ (called “RL policy”) is a copy of $\pi^{SFT}$ whose parameters are going to be updated by training. $x$ is prompt, and $y$ is the response output from the model.

The first term is the main objective we want to maximize. Note that, sampling comes from the model we are training, rather than a fixed dataset. After gradient updates, the sample distribution will be different.

The second term (negative KL divergence between distribution $\pi_\phi^{RL}(y\mid x)$ and $\pi^{SFT}(y\mid x)$) is a regularization term that penalizes deviation from the fine-tuned model from step 1. This is added to avoid losing all the information stored in the weights of the fine-tuned model.

The third term is also a regularization term. It is the objective of language modeling (maximizing log likelihood on next word). This is to prevent the model from focusing too much on increasing $r_\theta(x,y)$ and forgetting about next-word generation.

ChatGPT

ChatGPT was released on November 30, 2022, and soon caused a technology revolution around the world. Numerous applications like copy.ai and typeface.ai quickly emerged. I was impressed by how LLMs can dramatically boost human productivity. The point is not about whether ChatGPT will make not mistake and always give accurate answers. Rather, the point is that, it showed us how AI can have huge potential impacts on society.

Regarding model details, at this point, OpenAI is no longer open anymore. It didn’t publish research paper on ChatGPT. We can only infer information about the model from its blog:

Methods

We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.

To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.

— OpenAI Blog

From its description, we can see that ChatGPT is very similar to InstructGPT. The difference is that, long dialogues are used as dataset to train the model, instead of just prompts and responses data. When humans write the dialogues, they can refer to the model response as a starting point. Trained on this dataset, the model can better understand contexts in long dialogues, and produce answers that are relevant to the context. This is part of the reason why ChatGPT seems to have the impressive ability to remember your previous conversations.

GPT-4

GPT-4 was released on March 14, 2023. GPT-4 differs from previous GPT models in that it is multi-modal. It accepts both images and texts as inputs and produces texts as outputs. Little is known about its full architecture, but it is probably not much different from GPT-3 and InstructGPT. Here is the only relevant description from the GPT-4 technical report [7]:

GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

— GPT-4 Technical Report, OpenAI

However, one marvelous thing that we are able to see is that they developed infrastructure to reliably predict model performance. They could predict the loss of a very large model, from experiments at much smaller scales, allowing them to quickly try lots of designs and find the best one. A major criticism of deep learning is its black-box nature. Now OpenAI has mastered the alchemy of training deep neural networks. For reaching this step, they must have done much more experiments than any other organization or institution. No doubt they are the best winner so far!

A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based on models trained with no more than 1/1,000th the compute of GPT-4.

— GPT-4 Technical Report, OpenAI

Resources

🌀 Andrej Karpathy has a very good video tutorial on building GPT from scratch, I highly recommend it:

  • Colab of this tutorial: link🔗
  • nanoGPT repository: nanoGPT🔗

🌀 LLaMA is an open source alternative to GPT, proposed by Facebook. Lit-LLaMA🔗 is an implementation of LLaMA by Lightning-AI.

🌀 Huggingface Transformers is a Python library to download and use pre-trained models. There is GPT-2, but no GPT-3 or beyond.

🌀 Replicate provides APIs to use many vision models.

Intuitions behind LLMs

Why, at this time, deep learning works so well? I’d like to discuss some of my intuitions about LLMs.

What if the data covers 95% of everything

If there is only one example of something (e.g. question-answer pair) in the training data, then it could be outnumbered by other data, so either the model won’t remember this example, or it could overfit on it. However, if there are ten, a hundred or even more answers for the same question $q$ in the training data, all of which have similar phrasings, use similar terminologies, and point to similar things, then the model will develop a strong “memory” about it when trained over and over to maximize log likelihood. When the model is asked about this question, it has a good idea about how to output answer that has a high likelihood score: it can generate answers from the “average” of all answers in the training data, and the result will not be far from correct.

I suspect that most questions submitted to ChatGPT are common ones, and there are already relevant answers on the Internet. It’s like, for a question, Google would return you 100 relevant pages, while ChatGPT would return you one answer that is like the average of the 100 pages. And for uncommon ones or questions that deemed to be harmful, it can simply return “I don’t know” or “I can’t answer that”. So the magic is not all about AI, but more about the data.

It is not about finding needles in the ocean

Let’s think about next-word prediction. Locally, given a small scope of context, it is often not difficult to infer what the next word is. A large portion of the vocabulary can be eliminated from consideration. The probability distribution only concentrates on a small set of words, and the model only need to learn these simple distributions. From this perspective, it is not too difficult for a model to learn the grammars and produce normal-looking sentences. Words in sentences are not totally independent, but locally, they can often be broke up into independent pieces.

And Transformer on GPU is REALLY good at scaling up this simple idea of guessing the next word from previous words. Inferring the next word from a 4,000-word context is no different than inferring the next word from a 5-word context, the architecture is the same, it’s just that we need to use larger matrices and do more matrix multiplications, but this can be efficiently handled by GPU. So here one is trading compute power for search time. The problem of finding needles in the ocean is converted to compute massive matrix multiplications on the GPU. The Transformer model efficiently addresses the curse of dimensionality problem.

The success of ChatGPT is a combination of several factors:

  1. Unique features of human languages.
  2. Progresses in model architectures [10], regularization [11], and optimization algorithms [12] allow for stable training and better generalization.
  3. Transformer is efficient, can be scaled up, to trade compute for time.
  4. Invest millions of expenditure on data collection.

Of course, ChatGPT is not perfect, and can still make simple mistakes. Since logic reasoning is not part of the optimization objective, it will not possess sound logic. Trained only on text data, it also lacks perception about the 3D world, but I speculate that progress will soon be made. The variety of tasks that it can do is already impressive, and its boost to human productivity is undeniable. I’m sure AI models like ChatGPT will become an essential part of the society in the future, just like our cars and phones today.

Predicting the future: impacts on society

It looks like OpenAI is very ambitious. Its goal is not limited to developing large NLP models for dialogues. Rather, it aims to build an entire ecosystem around ChatGPT, where developers and companies all build products around their services. Regardless or whether ChatGPT will become the only super app we will use in the future, it is clear that the dynamics of the tech industry has been shifted. The way we work and live in the future will be different. It takes time, but eventually we will get there.

Many white-collar jobs will disappear. Many jobs today are “bullshit jobs”. Those jobs are highly repetitive, and do not add much value to the society. People are forced to sit in front of computer screens for 8+ hours and produce tons of garbage everyday.

One such job is desktop research, which involves collecting information from various sources on the Internet, and then producing reports so as to sell them to clients. ChatGPT is much better at collecting Internet information. With Plugins, companies can feed their internal documents to an AI model. The model can then retrieve information, answer questions and give advices. In the future, such intelligence collecting services will have much lower and affordable prices.

Another job that is probably in danger is data analyst. Basically what a typical data analyst do is writing SQL queries to extract data from databases, and performing data visualization and analytics, e.g. with Python. This job is highly repetitive. Now AI can handle all of that: it can write SQL queries, and it can plot graphs. In the future, if you want insights from data, you might just ask AI in plain English, and it will retrieve data, analyze it and generate report for you. The job may be fully automated. There would be no need to hire someone who remembers all the matplotlib/pandas commands and tricks anymore.

There are more. Every job with an “analyst” in its title should be examined against the widespread adoption of AI. One should be cautious with jobs that are not creative but only repetitive. I think entry level jobs in the following list of job categories are going to be threatened by AI. Even if job replacements will not happen very soon, the presence of a productive AI assistant means the difficulty and the skill sets required for those jobs are greatly reduced, so wages are going to be lower and lower, until those jobs disappear altogether.

  • Media jobs (advertising, news editing, journalism)
  • Consultants
  • Lawyers
  • Graphic designers
  • Data analysts
  • Financial analysts
  • Policy analysts
  • Accountants and auditors

Replacing repetitive jobs with machines is beneficial for the whole society, because we can stop wasting resources, focus on innovation, produce more, and ultimately improve everyone’s welfare.

Developers now have more power than ever before. Some argue that programming jobs are going to be replaced by AI. That is not accurate. Repetitive works do not have a future in any profession, but entrepreneurship will never cease to prosper. AI has made building applications easier, and that means developers can ship products faster with lower costs. Currently, building a serious full-stack application and delivering it to the market is difficult and time-consuming. It is a team work that requires talents for designs, frontend, backend, marketing, legals and more. Now, with AI that can assist with writing code, designing web pages and logos, drafting marketing campaigns, and composing terms of service for you, you can focus on realizing your ideas and building great products, rather than those business hassles. In the future, it may be possible to ship complicated applications entirely from one individual. Developers will no longer need to rely on big companies for living. They can become CEO of their own. Isn’t it just great?

ChatGPT’s success will further spur AI research. One research direction is to improve model efficiency. Current models are bulky and have high training costs, similar to what computers look like in the 50s and 60s. Maybe we could figure out the exact relationship between data, architecture, and generalization performance. Then we could optimize desired objectives with minimal data and minimal compute. Another research direction is video generation. Current AI models can generate impressive texts and images, but models that can generate videos haven’t been widely adopted. Video data is much richer in dimension, and understanding 3D dynamics is challenging. But I expect significant progress to be made in the near future. Like ChatGPT, the next breakthrough is likely going to come from industry, rather than academics.

Conclusion

In this blog post we discussed the technology behind OpenAI’s ChatGPT, as well as its potential impacts on society. GPT is a Transformer model that is trained to predict next word given a long context. Transformers summarize context information as matrix multiplications which can be computed efficiently on GPUs. ChatGPT is pre-trained GPT model fine-tuned on human labeled dialogue dataset. It can greatly boost human productivity, and thus it is a major technology advancement that is going to bring huge revolutions to society.

References

[1] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).

[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[3] Radford, Alec, et al. “Improving language understanding by generative pre-training.” OpenAI (2018).

[4] Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.

[5] Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.

[6] Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155 (2022).

[7] OpenAI. “GPT-4 Technical Report.” arXiv preprint arXiv:2303.08774 (2023).

[8] Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).

[9] Zhu, Yukun, et al. “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.” Proceedings of the IEEE international conference on computer vision (2015): 19-27.

[10] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition (2016).

[11] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv:1607.06450 (2016).

[12] Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).