Use GPT-3 Fine-tunes to train a dedicated language model

–

Article directory

- What is model fine-tuning?
- Why is model fine-tuning needed?
- - Fine-tuning vs retraining
  - Nudge vs Tip Design
- Train your own model
- - data preparation
  - Clean data
  - Build model
  - Fine-tuning the model
  - Evaluation model
  - Deployment model
- Summarize

What is model fine-tuning?

ChatGPT has been pre-trained using massive open data from the Internet and can give universal answers to any input. If we want to make ChatGPT’s answers more targeted, we can give examples when inputting, and ChatGPT can understand the tasks you want it to complete through “few-shot learning” and produce similar reasonable outputs.

However, “example learning” requires giving examples every time, which is very inconvenient to use. Fine-tuning can improve “short learning” by training more examples, so that using a fine-tuned model no longer requires examples to be provided in the input. This saves costs and enables lower latency requests.

More importantly, for some professional scenarios, the pre-trained model may not achieve the desired output effect. At this time, we need to provide more specific and corresponding data to specifically strengthen the model so that it can better answer questions in this field, thereby improving the overall effect.

In short, fine-tuning allows us to match a custom dataset to a large language model (LLM) so that the model still performs well in our specific task scenarios.

Why is model fine-tuning needed?

Fine-tuning vs retraining

It is necessary to distinguish between the concepts of fine-tuning and re-training.

Simply put, retraining is training a model from scratch with new data, while fine-tuning is adjusting the parameters of a previously trained model with new data.

For specific task scenarios, fine-tuning is faster and more economical than retraining in terms of time and cost.

Retraining GPT-3 or ChatGPT from scratch is prohibitively expensive. It is estimated that the cost of one training session for GPT-3 is approximately US$1.4 million, while the ChatGPT model is larger and costs approximately US$12 million for one training session. This does not include the cost of tens of thousands of A100 GPUs. An Nvidia A100 80G video memory graphics card is calculated at 50,000, and the initial cost of 10,000 A100 graphics cards alone is 5 small goals! Few companies can afford such huge expenditures on software and hardware.

Fine-tuning on GPT-3 also has costs. Taking the most powerful Davinci as an example, the training cost is US$0.03/1,000 tokens. This cost is very different from retraining. The figure below shows the training and usage costs of fine-tuning the model:

Therefore, for most companies at present, it is only suitable to make fine-tuning on GPT-3. Except for a few giants, most companies do not have the strength and ability to retrain.

Nudge vs prompt design

GPT-3 supports “few-shot learning”. We can improve the model output effect by giving examples when inputting prompts, but the improvement effect is far less than the effect of fine-tuning. The following is a comparison of the effects of fine-tuning and prompt design:

Compared with prompt design, fine-tuning the model can gain the following advantages:

Better output effect
Learn from more training data than examples
Reduce token consumption and save costs
Lower request latency

Train exclusive models

There are 6 main steps to start creating a fine-tuned model. In order to facilitate everyone’s understanding, I will demonstrate the fine-tuning process by combining the Python code of our customized customer service robot on GPT-3.

Data preparation

The data format required for GPT-3 fine-tuning is the exclusive JSONL format, which has the following form:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

The above data format is easy to understand – each line contains a prompt and a completion, which represents the ideal text corresponding to a specific prompt.

The data in our daily systems is generally not saved in JSONL format, so the data needs to be converted to JSONL format first. OpenAI provides command line tools to help us convert common data formats into JSONL format. The usage is as follows:

openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

Among them, is passed in a local data file, which supports CSV, TSV, XLSX, JSON and JSONL formats, as long as the data format in the file is Just include the prompt and completion columns or keywords.

A question that people often ask during the data preparation stage is *”How much data do I need to prepare for fine-tuning?”*. Generally speaking, the more the better, but due to the training cost of fine-tuning the design, we need to strike a balance. Open AI recommends providing at least 150-200 fine-tuning examples, but I personally found that 150-200 fine-tuning examples are often not enough in actual projects. It is recommended to first from a few hundred to a thousand pieces of data as a starting test , and then decide whether to add more training data based on the fine-tuned model effect.

GPT-3 supports continuous fine-tuning of custom models, so you can further fine-tune a previously fine-tuned model with new data at any time.

Clean data

Data quality is more critical than data quantity.

GPT-3 is essentially a large neural network that is a black box to us. So it is a typical “garbage in, garbage out”. The quality of model output is directly related to the quality of training data.

The higher the quality and diversity of the data, the better the model will work. A diverse set of examples is usually required to ensure that the model generalizes well to new examples. It’s a good idea to provide a positive example and a negative example to ensure the model can handle a variety of inputs.

In order to verify the quality of the fine-tuned model, we usually split the data into a training set and a validation set, usually split according to the ratio of 80% / 20% 80%/20% 80%/20%.

?Note: In addition to JSONL format, training data and validation data files must be UTF-8 encoded and contain Byte Order Mark (BOM), and the file size cannot exceed 200MB.

Build model

After the fine-tuning data is ready, we start fine-tuning the model. Before starting fine-tuning training, we need to determine the base model for fine-tuning.

Every fine-tuning job starts with a base model, which defaults to curie. Different base models will affect the performance of the model and the cost of fine-tuning the model. The basic models that currently support fine-tuning include: ada, babbage, curie or davinci.

The following is the model construction completed in Python:

import openai

openai.api_key = "YOUR_API_KEY"

resp = openai.FineTune.create(training_file="training_file_path",
                              validation_file="validation_file_path",
                              check_if_files_exist=True,
                              model="davinci")
job_id = resp["id"]
status = resp["status"]
print(f'Fine-tuning task ID: {job_id}, status: {status}\\
')

The above code selects davinci as the basic model, and passes in the local file paths of the training set and validation set to create a fine-tuning task. If the creation is successful, the ID and status of the fine-tuning task will be returned.

Creating a fine-tuning task also supports other parameters, which are described as follows:

parameter name

type

default value

illustrate

training_file

string

Training set file path, must be in JSONL format

validation_file

string

null

Verification set file path, must be in JSONL format –
If provided, validation metrics are generated periodically during fine-tuning. These metrics can be viewed in the fine-tuning results file. –
The training set data and validation set data must be mutually exclusive.

check_if_files_exist

boolean

true

Whether to check whether the file exists

model

string

curie

The name of the base model to be fine-tuned. You can choose ada, babbage, curie, davinci, or fine-tuned models created after 2022-04-21.

n_epochs

int

4

Train for a few rounds.

batch_size

int

null

The batch size for training. By default, the batch size will be dynamically configured to approximately 0.2% of the number of samples in the training set, with an upper limit of 256. Generally speaking larger batch sizes are more efficient for larger data sets.

learning_rate_multiplier

float

null

Learning rate coefficient. Fine-tuned learning rate = pre-trained original learning rate multiplied by this value. –
By default, the learning rate coefficient is 0.05, 0.1, or 0.2, depending on the final batch size (larger learning rates tend to perform better with larger batch sizes). We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results.

prompt_loss_weight

float

0.01

Prompt loss of weight. Controls how the model learns to generate hints (generating outputs with a weight of 1) and increases the stability of training when outputs are short. –
If the prompt is very long (relative to the output), then reducing this weight can avoid overlearning.

compute_classification_metrics

boolean

false

If true, the validation set is used to calculate effects such as accuracy and F-1 score at the end of each training epoch. These metrics can be viewed in the results file.

classification_n_classes

int

null

The number of categories in the classification task. –
This parameter is required for multi-class classification.

classification_positive_class

string

null

Positive examples in binary classification tasks. –
When performing binary classification, this parameter is needed to generate precision, recall and F1 score.

classification_betas

array

null

If provided, the F-beta score will be calculated based on the specified beta value. F-beta is a generalization of F-1. Only used for binary classification tasks. –
When beta is 1 (i.e. F-1 score), precision and recall have the same weight. The larger the beta, the greater the weight of the recall rate and the smaller the weight of the precision rate. The smaller the beta score, the higher the precision weight and the lower the recall weight.

suffix

string

null

A string of up to 40 characters that will be added to the fine-tuned model name.

Fine-tuned model

Fine-tuning tasks are usually in the Pending state after they are created. This is because there are usually other tasks in the OpenAI system that are queued before you, and our tasks will be placed in the queue first, waiting to be processed. Generally once in training, fine-tuning the training can take minutes or hours, depending on the base model chosen and the size of the data set. We can use the fine-tuning task id to query the status of the fine-tuning task:

while status not in ["succeeded", "failed"]:
    time.sleep(2)
    # Get the status of the fine-tuning task
    status = openai.FineTune.retrieve(id=job_id)["status"]
    print(f'Fine-tuning task ID: {job_id}, status: {status}')
    
print(f'Fine-tuning task ID: {job_id} completed, end status: {status}\\
')

Evaluation model

After successful fine-tuning, the training results will be output. You can obtain the evaluation results through the following code:

fine_tune = openai.FineTune.retrieve(id=job_id)
result_files = fine_tune.get("result_files", [])
if result_files:
    result_file = result_files[0]
    resp = openai.File.download(id=result_file["id"])
    print(resp.decode("utf-8"))

There is rich model evaluation data here for us to evaluate the quality of model fine-tuning.

Deployment model

If the model results are satisfactory, we can put the model into production. The fine_tuned_model in the data structure returned by the openai.FineTune.retrieve() method is the name of the fine-tuned model. You can directly use this model name in the API.

model_name = openai.FineTune.retrieve(id=job_id)["fine_tuned_model"]

response = openai.Completion.create(
  model=model_name,
  prompt="What should I eat tonight?\\
",
  temperature=0.7,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stop=["END"]
)

Summary

The most amazing thing about ChatGPT is that it can talk like a human. The powerful natural language understanding and expression behind this smooth human-machine dialogue are currently only shown in the general field. Once entering a certain professional field, ChatGPT will often be “serious and talk nonsense”. At this time, fine-tuning the model using domain-specific knowledge is the most time- and economically-costly solution. It turns out that even the smallest training data can lead to significant performance improvements.

As LLM becomes larger, more accessible, and open source in the future, I believe we can see fine-tuning becoming ubiquitous in natural language processing in the near future. At the same time, I am also very much looking forward to breakthroughs in edge learning, which can reduce the cost of large model training. Then we will look at how to retrain a large model from scratch.