Exclusive | How ChatGPT Works: The Model Behind the Robot

by Molly Ruby
Translated by: Zhang Ruiyi
Proofreading: Yan Xiaoyu


This article is about 3,000 words, and it is recommended to read for 5 minutes to briefly introduce the intuition and methodology behind the chatbot that makes your ears hear cocoons.

This machine learning machine learning model for Chat A brief introduction, starting with a large language model (LLM), and then discussing the revolutionary self-attention mechanism (self-attention mechanism) that enables GPT-3 to be trained, and then studying human feedback reinforcement learning (Reinforcement Learning From Human Feedback, RLHF ) – the innovative technology that sets ChatGPT apart.

Large language models

ChatGPT is an extrapolation of a class of machine learning natural language processing models known as Large Language Models (LLMs). LLM digests large amounts of text data and infers the relationship between words in the text. Advances in computing power visible to the naked eye over the past few years have led to the development of these models. The performance of LLMs increases as the size of the input dataset and parameter space increases.

The most basic training of a language model involves predicting words in sequences of words. The most common approaches are next-token prediction and masked-language-modeling (MLM).

Figure 1: Author’s example of next token prediction and masked language modeling (MLM)

Sequence modeling is often done with long short-term memory (LSTM) models, which fill in the gaps with the statistically most probable word given a context. This sequence model has two main limitations:

1. The model fails to give higher weight to certain words. In the above example, while “read” is probably most commonly associated with “hate”, in the database, “Jacob” might be an avid reader, then the model should give “Jacob” a ratio of “Read” more weight, choose “Love” instead of “Hate”.

2. The input data is processed sequentially step by step, not as a whole by analysis. This means that when training the LSTM model, the context window is fixed and only covers a single input, which is processed sequentially and step by step. This limits the complexity of the relationships between words and the connotations that can be derived from them.

In response to this problem, in 2017, a team from Google Brain introduced the Transformer model. Unlike LSTM, it can process all input data at the same time. It employs a self-attention mechanism and can also assign different attention scores to different segments of the input data anywhere in the sentence. This feature makes it inject soul into LLM, so that it can capture richer content and process larger data sets.

GPT and Self-Attention

The generative pre-trained Transformer (GPT) model was first introduced by OpenAI in 2018, named GPT-1. This set of models iteratively evolved into GPT-2 in 2019, GPT-3 in 2020, InstructGPT and ChatGPT in 2022 recently. Before the stage of integrating human feedback into the system, the biggest advances in the evolution of GPT models were driven by achievements in computational efficiency. The improvement of computing efficiency enables GPT-3 to accept much more data training than GPT-2, enabling it to have a more diverse knowledge base and the ability to perform a wider range of tasks.

Figure 2: The author’s comparison of GPT-2 (left) and GPT-3 (right)

All GPT models utilize the Transformer architecture, which means they have an encoder to process the input sequence and a decoder to generate the output sequence. Both the encoder and decoder have a multi-head self-attention mechanism that allows the model to weight different parts of the sequence differently to infer meaning and context. Additionally, the encoder utilizes masked language modeling (MLM) to understand the relationship between words and generate more intelligible responses.

The self-attention mechanism that drives GPT works by converting tokens (text fragments that can be words, sentences, or other text groupings) into vectors that represent the importance of the token in the input sequence. The model does this in four steps:

1. Create three vectors for each token in the input sequence: “query”, “key”, and “value”.

2. Calculate the similarity between the “query” vector from step 1 and each of the other labeled “key” vectors by taking the dot product of the two vectors.

3. Generate normalized weights by passing the output of step 2 into the softmax function.

4. By multiplying the weights generated in step 3 by each marker’s “value” vector, a final vector is generated representing the marker’s importance in the sequence.

The “multi-head” attention mechanism used by GPT is an evolution of the self-attention mechanism. Instead of performing steps 1 to 4 in one go, the model iterates this mechanism multiple times: each time new vector projections are generated for the ‘query’, ‘key’ and ‘value’. By extending self-attention in this way, the model is able to grasp more complex relationships between the underlying meanings of the input word data.

Figure 3: Screenshot generated by the author from ChatGPT.

Although GPT-3 has made remarkable progress in natural language processing, it is limited in its ability to match user intent. For example, GPT-3 may produce output of the following nature:

Lack of help means they don’t follow clear instructions from the user.
A fictional fact that reflects a fact that does not exist or is incorrect.
Uninterpretable, making it difficult to understand how a model arrived at a particular decision or prediction.
Toxic/biased, contains harmful or offensive content, spreads misinformation.

A novel training method is introduced in ChatGPT to solve some inherent problems of the standard version of LLM.

ChatGPT

ChatGPT, a derivative of InstructGPT, introduces a new way to incorporate human feedback into the training process to better align model output with user intent. Reinforcement Learning with Human Feedback (RLHF) is described in depth in OpenAI’s 2022 paper “Training language models to follow instructions with human feedback”, briefly described below.

Step 1: Supervised fine-tuning (SFT) model

The first step of development involved fine-tuning the GPT-3 model by hiring 40 contractors to create a supervised training dataset, where the inputs had known outputs for the model to learn from. Input or prompts are collected from actual user input into the open API. Annotators are then prompted to write appropriate responses, creating known outputs for each input. The GPT-3 model is then fine-tuned using this new supervised dataset to create the GPT-3.5, also known as the SFT model.

To maximize the diversity of the hint dataset, only 200 hints were shortlisted for any given user ID, and any hints that shared a longer common prefix were also removed. Finally, all prompts containing personally identifiable information (PII) were removed.

After aggregating the hint information from OpenAI API, the annotators are also required to manually create sample hints for those few types of hints that have very little actual sample data, making the hint dataset richer. include:

Simple tip: any random questions.
Small Sample Hint: Directives that contain multiple “query/response” pairs. (Note: It is equivalent to writing several sample questions for a certain question type)
User-based prompting: Refers to the user providing examples or instructions to guide the AI to generate a specific output.

When generating responses, annotators are asked to do their best to infer what the user’s command was. The paper describes three main ways in which prompts request information:

1. Direct: “Tell me about…”

2. Small sample format: Give two examples of stories about a certain topic, then write one story about the same topic.

3. Continuation: Give the beginning of a story and then finish it.

Compiling the hints from the OpenAI API and handwritten hints from the annotators yielded a total of 13,000 input/output samples for training the supervised model.

Figure 4: The picture (left) is from the paper Training language models to follow instructions with human feedback published by OpenAI in 2022. Red lettering (right) is additional content added by the author.

Step 2: Reward Model

After the SFT model is trained in the first step, the model produces more compliant responses to user prompts. The next step in improvement is achieved by training a reward model whose input is a sequence of prompts and responses, and whose output is a scalar called “reward”. The training of the reward model is to use reinforcement learning (Reinforcement Learning), so that the model learns how to produce outputs to maximize its reward value (see step 3).

To train the reward model, annotators see 4 to 9 SFT model outputs for a single input cue. They were asked to rank these outputs from best to worst and create output ranking combinations as shown below.

Figure 5: Examples of authors’ ranking combinations of responses.

Including each combination as a separate data point into the model leads to overfitting (failure to generalize to unseen data). To address this, the model treats each set of rankings as a batch of data points.

Figure 6: The picture (left) is from the paper Training language models to follow instructions with human feedback published by OpenAI in 2022. Red lettering (right) is additional content added by the author.

Step 3: Reinforcement Learning Model

In the final stage, the model is given a random prompt and returns a response. This response is produced using the “policy” the model learned in the second step. This policy represents the goal of machine learning, which is to maximize its reward. Based on the reward model developed in the second step, a reward value is calculated for the prompt and response pair. Rewards are fed back into the model to upgrade the policy.

In 2017, Schulman et al. introduced proximal policy optimization (PPO), a method for updating a model’s policy every time a response is generated. PPO incorporates a per-marker Kullback-Leibler (KL) penalty for the SFT model. KL divergence measures the similarity between two distribution functions and penalizes extreme distances. In this case, a KL penalty is used to limit the distance between the responses produced by the reward model in the second step and the output of the SFT model trained in the first step, to avoid over-optimizing the reward model and deviating too much from the human intent dataset.

Figure 7: The picture (left) is from the paper Training language models to follow instructions with human feedback published by OpenAI in 2022. Red lettering (right) is additional content added by the author.

The second and third steps of the process can be iterated repeatedly, but this has not been done extensively in practice.

Figure 8: Screenshot generated by the author from ChatGPT.

Model Evaluation

Evaluation of the model is performed on a test dataset that the model has never seen during training. A series of evaluations were performed on this test set to determine whether the model was better at generating compliant responses than its predecessor, GPT-3.

Helpfulness: The model’s ability to reason and follow user instructions. Annotators prefer the output of InstructGPT to GPT-3 85±3% of the time.

Realism: Controls the tendency of the model to appear unreal. When evaluated with the TruthfulQA dataset, the output produced by the PPO model shows a slight increase in authenticity and informativeness.

Harmlessness: Models the ability to avoid inappropriate, demeaning, and derogatory content. Harmlessness was tested using the RealToxicityPrompts dataset. The test is performed in three states:

1. The model was instructed to provide friendly and respectful responses: resulting in a significant reduction in toxic responses.

2. Models are instructed to provide responses, without any settings for respect: no perceptible change in harmfulness.

3. The model is instructed to provide toxic responses: the responses are actually more toxic than those of the GPT-3 model.

For more information on the methods used to create ChatGPT and InstructGPT, please read OpenAI’s original paper Training language models to follow instructions with human feedback, published in 2022, https://arxiv.org/pdf/2203.02155.pdf.

Figure 9: Screenshot generated by the author from ChatGPT.

I wish you a happy study!

Sources

1. https://openai.com/blog/chatgpt/

2. https://arxiv.org/pdf/2203.02155.pdf

3. https://medium.com/r/?url=https://deepai.org/machine-learning-glossary-and-terms/softmax-layer

4. https://www.assemblyai.com/blog/how-chatgpt-actually-works/

5. https://medium.com/r/url=https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b

Original Title:

How ChatGPT Works: The Model Behind The Bot

Original link:

https://towardsdatascience.com/how-chatgpt-works-the-models-behind-the-bot-1ce5fca96286

A brief introduction to the intuition and methodology behind the chat bot you can’t stop hearing about.

Editor: Yu Tengkai

Proofreading: Lin Yilin

Translator Profile

Hello everyone, I am Felix Zhang Ruiyi. You may have seen my translated articles before. But now I hope that through some of my own innovations, more and more people will like my translated articles. At the same time, from the perspective of learning, I personally hope that everyone can see the effect of the “Jespersen Grammatical System” introduced by teachers Guan Weidong and Tang Tang on English translation. I speak only for students who accept this grammar, but I’m ready.

Translation Team Recruitment Information

Job Content: It takes a careful heart to translate the selected foreign language articles into fluent Chinese. If you are an international student of data science/statistics/computer, or are engaged in related work overseas, or friends who are confident in your foreign language proficiency, welcome to join the translation team.

You can get: Regular translation training to improve the translation level of volunteers, improve the awareness of the frontier of data science, overseas friends can keep in touch with the development of domestic technology applications, and the background of THU data industry, education and research Bring good development opportunities for volunteers.

Other benefits: Data science workers from famous companies, students from famous universities such as Peking University, Tsinghua University and overseas will become your partners in the translation team.

Click “Read the original text” at the end of the article to join the Datapai team~

Reprint Notice

If you need to reprint, please indicate the author and source in a prominent position at the beginning of the article (from: Datapi ID: DatapiTHU), and place an eye-catching QR code at the end of the article. If you have an original logo article, please send [article name – official account name and ID to be authorized] to the contact email, apply for whitelist authorization and edit as required.

After publishing, please send the link back to the contact email (see below). Unauthorized reprinting and adaptation, we will pursue their legal responsibilities according to law.

Click “Read the original article” to embrace the organization

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge