NLP large model fine-tuning Q&A

When to use the Bert model, when to use LLaMA, ChatGLM large models, how to choose?

Answer: Bert’s model consists of a multi-layer bidirectional Transformer encoder, consisting of 12 layers, 768 hidden units, 12 heads, and a total of 110M parameters, about 115 million parameters. NLU (Natural Language Understanding) tasks work well, and a single GPU can be deployed with high speed. The V100GPU can process more than 2,000 items per second.

The ChatGLM-6B and LLaMA-7B models are large models with 6 billion parameters and 7 billion parameters respectively, which can basically handle all NLP tasks with good results, but the deployment cost of large models is high, a GPU with large memory is required, and the prediction speed is slow , V100 needs 1 second.

So suggest:
1) NLU-related tasks can be handled very well with the BERT model, such as entity recognition, information extraction, and text classification, and there is no need to use a large model;
2) NLG tasks, pure Chinese tasks, use ChatGLM-6B, need to deal with Chinese and English tasks, use chinese-llama-plus-7b or chinese-alpaca-plus-7b-hf

What is the difference between ChatGLM-6B and LLaMA-7B?

Answer: ChatGLM-6B uses the GLM model structure, prefix LM, its attention mask part, and the tokens in the prefix part can be seen from each other. At the beginning of the model design, NLU tasks and NLG tasks were considered.
GLM model structure:

LLaMA-7B is a GPT model structure, causal LM, its attention mask part, only the back token can see the front token, one-way from left to right, decoder only.

The current mainstream large model, except T5 is a two-way structure, GLM is a prefix LM, and all others are causal LM.

What is the function of instruction fine-tuning?

Answer: There are the following benefits:

Aligned with human intent, able to understand natural language conversations (more human)
After fine-tuning, the customized version of GPT-3 has obvious improvement in different applications. OpenAI said that it can directly increase the accuracy of different applications from 83% to 95%, and the error rate can be reduced by 50%. The correct rate of solving primary school math problems can also be increased by 2-4 times. (more accurate)
Stepping on the shoulders of giants and fine-tuning directly on the large model with 175 billion parameters, many R&D personnel no longer need to retrain their AI models. (more efficient)

Regarding the understanding of the role of instruction fine-tuning, add the following situation to understand.

We have a large model that has been trained and is very capable, but there is a problem that the model does not necessarily know what humans want to do. for example:

If we ask GPT a question: Which is the highest mountain in the world?
The answer we want is: the Himalayas.
But the data in the pre-training data set may be: "Which mountain is the highest mountain in the world? Which child knows, tell the teacher that there is a little red flower"
So GPT is likely to reply "Which kid knows, tell the teacher that there is a little red flower", because the pre-training stage is to ask GPT to predict the next word (next-word prediction) task.

At this time, the necessity of fine-tuning is reflected.

What is the fine-tuning method? How to fine-tune?

Answer: The current mainstream fine-tuning methods are divided into: Fine-tune and prompt-tune

fine-tune, also known as full-parameter fine-tuning, is a method that the bert fine-tuning model has always used. All parameter weights are updated to adapt to the domain data, and the effect is good.
prompt-tune, including p-tuning, lora, prompt-tuning, adaLoRA and other delta tuning methods, some model parameters participate in fine-tuning, fast training, less video memory usage, the effect may be slightly less effective than FT (fine-tune), But the general effect can be tied.

Lianjia’s experiment in BELLE’s technical report “A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model” shows that FT effect is slightly better than LoRA.

Peft’s paper “ADAPTIVE BUDGET ALLOCATION FOR PARAMETER- EFFICIENT FINE-TUNING” shows the results: AdaLoRA is slightly better than FT.

A few more points about lora fine-tuning:

The choice of base model has a significant impact on the effectiveness of LoRA-based fine-tuning.
The more training sets, the better
The LoRA fine-tuning method has more obvious advantages when the model parameters are larger

This conclusion refers to the technical report “A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model”.

Why can the LoRA fine-tuning method speed up training?

Answer: There are several reasons:

Only some parameters are updated: For example, the original LoRA paper chooses to update only the parameters of Self Attention. In actual use, we can also choose to update only the parameters of some layers;
Reduced communication time: Since the amount of updated parameters is reduced, the amount of data to be transmitted (especially during multi-card training) is also reduced, thereby reducing the transmission time;
Various low-precision acceleration technologies are adopted, such as FP16, FP8 or INT8 quantization, etc.
These three reasons can indeed speed up the training, but they are not unique to LoRA. In fact, almost all parameter-efficient methods have these characteristics. The advantage of LoRA is that its low-rank decomposition is very intuitive. In many scenarios, it has the same effect as full fine-tuning, and it does not increase the reasoning cost in the prediction stage.

How to train your own large model?

Answer: If I am now making a large Chinese GPT model of sota, it will be divided into two steps: 1. Secondary pre-training on LLaMA-65B based on Chinese text data; 2. Add CoT and instruction data, using FT + LoRA SFT.

The method of refining is generally divided into two stages of training:
The first stage: expand the domain vocabulary, such as the financial domain vocabulary, and pre-train the LLaMA model on massive domain document data;
The second stage: Construct the instruction fine-tuning data set, and perform instruction fine-tuning on the basis of the pre-training model in the first stage. It is also possible to combine instruction fine-tuning data sets into a document format and put them in the incremental pre-training in the first stage, so that the model can first understand the downstream task information.

Of course, there is a low-cost solution, because we have a powerful LoRA tool. LoRA can be used for training in the first and second stages. A single card 3090 is fine.

The first stage data format:

[
{
"content":"The Supreme People's Court of the People's Republic of China's Notice of Rejection of the Appeal (2022) Supreme Court Xingshen No. 122 Yuan Mouyin and Yuan Moucai: You are suing Lishui District, Nanjing City, Jiangsu Province for the case of intentional injury of the defendant Yuan Deyin in the original trial The People's Court (2014) Li Xing Chu Zi No. 268 Criminal Judgment and the Nanjing Intermediate People's Court (2015) Ning Xing Zhong Zi No. 433 Criminal Ruling were not satisfied, because the victim Zhu Kuanrong's CT (No. 136678) report during hospitalization did not show that The 4th, 5th, 6th, 7th, and 8th ribs on the left side were fractured, and the discharge records and CT reports No. 137470 and No. 143006 were all forged. On the grounds that the formation time was wrongly determined, an appeal was lodged with this court, requesting to revoke the original judgment and retry the case according to law. After careful review, this court formed a collegial panel according to law and concluded that the original trial found that the defendant Yuan Deyin beat the victim Zhu Kuanrong to his left due to a neighborhood dispute. There are multiple rib fractures in the chest, which constitutes a minor injury of the second grade, and his behavior constitutes the crime of intentional injury, which is not inappropriate. Regarding the grounds for your appeal that the evidence that the victim Zhu Kuanrong was found to have a minor injury of the second grade in the original trial was forged. First, according to the No. 136678 you provided According to the CT report, Zhu Kuanrong was diagnosed with multiple left rib fractures by CT examination on February 12, 2015 when he was admitted to the hospital. Although the CT report did not clarify the specific location of the fractures, it was consistent with the "multiple left rib fractures" recorded in the discharge record ( 4, 5, 6, 7, 8)" are not contradictory. According to the "Appraisal Standards for the Degree of Human Injury", two or more rib fractures constitute a minor injury of the second grade, so the appraisal opinion determined that Zhu Kuanrong's injury constituted a minor injury of the second grade It is not inappropriate. Secondly, you claim that the discharge records and CT reports No. 137470 and No. 143006 are forged, but you have not provided evidence to prove it. Even if these two CT reports are excluded, it will not affect the determination of the facts of this case. Thirdly, our court The object of the review of the appeal is the judgment or ruling that has already taken legal effect. The effective judgment of this case is the (2014) Li Xing Chu Zi No. 268 Criminal Judgment and the (2015) Ning Xing Zhong Zi No. 433 Criminal Ruling, the senior people of Jiangsu Province The notice of dismissal of the appeal issued by the court is not within the scope of our court’s review. In summary, this court believes that your appeal does not meet the circumstances that should be retried as stipulated in Article 253 of the Criminal Procedure Law of the People’s Republic of China. Rejected. I hope you will respect the effective judgment of the People's Court, consciously accept the judgment and suspend the lawsuit. Hereby notice. December 14, 2022"
}
]

Data sample for the second stage:

[
    {
      "instruction": "How long is the sentence for drunk driving and bumping into someone?",
      "input": "",
      "output": "Article 133 of the "Criminal Law" stipulates: Whoever violates the regulations on traffic and transportation management, thereby causing a major accident, causing serious injury, death, or heavy loss of public or private property, shall be sentenced to fixed-term imprisonment of not more than three years or criminal detention; whoever escapes after causing a traffic accident and causes serious injury shall be sentenced to fixed-term imprisonment of not less than three years but not more than seven years. Whoever escapes after causing a traffic accident and causes death shall be sentenced to fixed-term imprisonment of not less than seven years."
    },
    {
"instruction":"What will happen if the dog is caught by the police station?",
   "input": "",
  "output":"Depending on the value of the stolen dog, it may be treated as a crime of theft. If the value reaches the criminal standard, it constitutes the crime of theft and bears criminal responsibility; if it does not meet the criminal standard, it will be punished by public security, fined or detained Public Security Punishment This may constitute theft. If it is less than one thousand yuan, it will not constitute a crime. If it exceeds one thousand yuan, it may constitute a crime."}
]

A reference example for legal model training: https://github.com/pengxiao-song/LaWGPT

What experience do you have in training large Chinese models?

Answer: In the Lianjia technical report “Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation”, the training and evaluation methods of open source models are introduced:

The ablation experiments of each factor are also compared:

Ablation experiment conclusion:

After expanding the Chinese vocabulary, you can increase the model’s ability to understand Chinese, and the effect is better
The higher the quality of the data, the better, and the improvement of the quality of the data set can improve the effect of the model
Data language distribution, the effect of adding Chinese is better than not adding
The larger the data size and the higher the quality, the better the effect. A large number of high-quality fine-tuning data sets can improve the effect of the model most obviously. Explanation: data volume In terms of training data volume, an increase in data volume has been shown to significantly improve performance. It is worth noting that such a large improvement may partly come from the similar distribution between belle-3.5 and our evaluation data. The categories, topics and complexity of the assessment data will have a big impact on the assessment results
The evaluation performance of LLaMA-7B-EXT after expanding the vocabulary has reached the level of 0.762/0.824=92%

Their technical report proves that the training of Chinese large models is feasible, although there is still a gap with ChatGPT. It needs to be pointed out here that the follow-up RLHF is also very important, and I will list it here to inspire others.

How many pieces of data are needed for fine-tuning?

Answer: It depends on whether the data distribution of the pre-training data and the fine-tuning task are consistent. If the distribution is consistent, 100 data is enough. If the distribution is very different, more data is needed, and it is better to have more than 1000 data.

If your own task is complex or the downstream task industry is relatively unpopular, such as drug name recognition task, you need more supervision data. And when fine-tuning the large model, it is impossible to remember it once. For 100 pieces of fine-tuning data, epochs=20 can stably fit the task requirements.

What is the reason for emergent ability?

Answer: According to previous analysis and paper summary, there are roughly two conjectures:

The evaluation index of the task is not smooth enough;
Complex tasks vs subtasks, this is actually easy to understand. For example, we assume that a certain task T consists of 5 subtasks Sub-T, and each sub-T increases with the growth of the model, and the index increases from 40% to 60%, but the final task The indicator has only increased from 1.1% to 7%, which means that the phenomenon of emergence has been seen macroscopically, but the effect of subtasks is actually growing smoothly.

How to continue training on the existing LoRA model?

Answer: I understand the situation of this problem is: the existing lora model has only trained part of the data, if you want to train another part of the data, should you continue training on this lora, or merge it with the base model and then apply a layer of lora, or Train a lora from scratch?

I think that after merging the previous LoRA with the base model, it is enough to continue training. In order to retain the previous knowledge and ability, it is necessary to add some previous training data when training the new LoRA. In addition, it is costly to start over every time.

How to evaluate the large model?

A: Currently, the benchmarks of superGLUE, GLUE, and Chinese CLUE are not suitable for evaluating large models. It is possible to evaluate reasoning ability and multi-round dialogue ability as the core.

The paper “C-EVAL: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models” proposes a Chinese large language model evaluation benchmark: C-EVAL, which has certain reference significance.
github link: https://github.com/SJTU-LIT/ceval
Paper: https://arxiv.org/pdf/2305.08322v1.pdf
benchmark: https://cevalbenchmark.com/static/leaderboard.html

How is the honest principle of the large model realized? How does the model judge that the knowledge answered is known knowledge that has been trained, and how to train this ability?

A: Large models need to follow the principles of helpful, honest, and harmless.
The following training samples can be intentionally constructed to improve the model’s quasi-honest principle, which can be regarded as a trick:

When fine-tuning, construct a training set of knowledge questions and answers, give no answer if you don’t know, and strengthen the principle of honesty;
For reading comprehension questions, answer those you have read, don’t answer those you haven’t read, don’t talk nonsense.

Does each professional field need its own large model to serve?

A: Yes, large models in all walks of life are the trend.

Is the large model capable of reasoning?

Answer: Large models have reasoning capabilities. There are two manifestations:

ChatGPT has the ability of in-context correction, that is, if you make a mistake, give a correction, ChatGPT can “understand” where the mistake is, and correct it in the right direction. In-context correction is much more difficult than in-context learning. The more detailed and clear the description, the better ChatGPT will answer. You know, the more detailed the description, the harder it is to match in the pre-trained text.
When you ask ChatGPT about content that does not exist on the Internet, you can give better answers (such as using ChatGPT to learn modeling); ChatGPT can guess your thoughts through information; you can formulate a new game rule for ChatGPT to play with you , ChatGPT can understand.

Which stage of pre-training and fine-tuning injects knowledge?

Answer: Knowledge is injected into the pre-training stage, and fine-tuning is trained on specific tasks, so that the general knowledge of the pre-trained model can be combined with the requirements of specific tasks, so that the model can perform better on specific tasks.

I want the model to learn knowledge of a certain field or industry, should it be pre-trained or fine-tuned?

Answer: You can use a combination of pre-training and fine-tuning. First, pre-train with chapter data to obtain extensive knowledge, and then use question and answer to fine-tune the data, so that the model can better learn knowledge in specific fields.
Of course, there is no difference in the implementation method of the pre-training and fine-tuning of the GPT large model. They are all decoder-only language model training and updating parameters. If the sample set is small, there are not a lot of chapters For document data, I think that knowledge can be injected only by fine-tuning, so there is no need to worry too much about pre-training. Moreover, the distribution of the specific domain is not much different from that of the pre-trained model, and there is no need for secondary pre-training.

How to fine-tune the model for multi-turn dialogue tasks?

Answer: It is the same as the normal instruction fine-tuning method, but the historical chat records are stitched together when organizing the data, and fed to the model as context information.

{"prompt": "Great Wall h3 fan does not turn. The relay is good. The fuse is good and the sensor is new. The fan is also new. This is why. The relay lacks a signal line", "response": "It can be read with a computer Data flow? What is the water temperature", "history": []}
{"prompt": "95", "response": "How is the temperature difference between the upper and lower water pipes? Is the air exhausted?", "history": [["The Great Wall h3 fan does not turn. The relay is OK. The fuse A good sensor and a new fan are also new. This is why. The relay lacks a signal line", "Can I use a computer to read the data flow? What is the water temperature"]]}

{"prompt": "Yes. Both the upper and lower water pipes are fine", "response": "Then you need to check the circuit. Generally, the fan relay is controlled by the computer. If there is an open circuit in the circuit or the computer is broken, it will The relay does not pull in!", "history": [["Great Wall h3 fan does not turn. The relay is good. The fuse is good, the sensor is new and the fan is new. This is why. The relay is missing a signal line", " Can I use a computer to read the data stream? What is the water temperature"], ["95", "How is the temperature difference between the upper and lower water pipes? Has the air been drained?"]]}

The ability of the fine-tuned model deteriorates, what is the catastrophic forgetting?

Answer: The so-called catastrophic forgetting: that is, after learning new knowledge, the previously acquired content is almost completely forgotten. This is a question raised by some students when fine-tuning the ChatGLM-6B model. The original ChatGLM-6B model is correct in answering knowledge questions such as “what to do about insomnia”, but it introduces specific tasks (such as spelling error correction CSC) After fine-tuning the data set, let the model predict the result of “what to do about insomnia” and the answer is irrelevant.

I understand that the ChatGLM-6B model is a model after the “pre-training-SFT-RLHF” process training. Thousands of instruction fine-tuning tasks have been trained in the SFT stage. Now we just add a new type of instruction data, which is relatively large. That’s all, the small amount of fine-tuning data and the single type of fine-tuning tasks will not have a big impact on its original capabilities, so I don’t think it will cause catastrophic forgetting.

There are several tricks to improve this phenomenon:

Adjust the fine-tuning training parameters, do not set the initial learning rate of fine-tuning too high, lr=2e-5 or smaller, this problem can be avoided, and should not be greater than the learning rate during pre-training.
Fine-tuning training set settings: Multiple representative tasks should be selected, and the number of instances of each task should not be too large (such as hundreds) otherwise it may potentially cause overfitting problems and affect model performance.
The proportion of different tasks should be balanced, and the capacity of the entire dataset (usually thousands or tens of thousands) should be limited to prevent larger datasets from overwhelming the entire distribution.

How much video memory is needed for fine-tuning the model?

Answer:

What are the training sets for large models?

Answer: The pre-training data set togethercomputer/RedPajama-Data-1T “Red Pajama” open source project consists of three parts:

High-quality, large-scale, high-coverage pre-training datasets;
The basic model trained on the pre-trained data set;
Instructions to tune datasets and models are safer and more reliable than basic models.

The pre-training data set RedPajama-Data-1T has been open sourced, including seven subsets. The number of tokens obtained after preprocessing can roughly match the number reported by Meta in the original LLaMA paper, and the scripts related to data preprocessing have also been open sourced.
The storage capacity required for the complete RedPajama-Data-1T dataset is 3TB after compression and 5TB after decompression.

CoT fine-tuning dataset: Alpaca-CoT includes commonly used alpaca, CoT and other datasets, including Chinese.

How to set the parameters when the model is generated?

Answer: Generate model prediction and parameter adjustment suggestions:

It is recommended to adjust top_p, num_beams, repetition_renalty, temperature, do_sample=True;
Data generation has repetition, increase repetition_renalty;
If the expression of the generated task is single and there are not many samples, the temperature can be lowered appropriately, and the generated appearance is more similar to the training set; If you want to reproduce the effect of the training set, temperature=0.01 is enough.
The above are empirical parameters, and the specific adjustment parameters depend on the task and are not fixed.

Parameter explanation:

top_p=0.9,
#Moderately increase the probability threshold of nucleus sampling to increase the quantity of candidate tokens and increase generation diversity.

temperature=1.0,
#The previous low temperature parameter could lead to a severe polarization in the probability distribution of generated words, which degenerates the generation strategy into greedy decoding.

do_sample=True,
#do_sample parameter is set to False by default. After setting to True, the generation methods turn into beam-search multinomial sampling decoding strategy.

no_repeat_ngram_size=6,
#Configure the probability of the next repeating n-gram to 0, to ensure that there are no n-grams appearing twice. This setting is an empirical preliminary exploration.

repetition_penalty=1.8,
#For words that have appeared before, in the subsequent prediction process, we reduce the probability of their reoccurrence by introducing the repetition_penalty parameter. This setting is an empirical preliminary exploration.

Why does ChatGPT use RL, and what is the role of PPO?

A: This is an open question, currently under discussion.

First explain that the effect of the model after RL is improved:
To quote Yoav’s recent opinion in his gh-gist (a lot of dry stuff), if you are interested, you can read the original text: https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81

Diversity perspective: RL can provide more diverse answers, because for Supervised Learning (SL), as long as the model is slightly deviated from the training samples during training, it will be punished. In fact, human language will express the same meaning in various ways, SL can confuse models in this way, especially models with good generalization understanding.
Negative feedback angle: There is only positive feedback in supervised learning, and RL can provide negative feedback signals. From the perspective of formal learning theory, negative feedback signals will be better (to be honest, I don’t understand).
From the perspective of self-knowledge perception: the most convincing one, first of all, according to the current main application scenarios of ChatGPT, there is a large category of problems called “knowledge acquisition type”, which has a lot to do with RL training.

Because for such questions, we hope that the model can give a real and convincing answer, and at the same time refuse to answer when we don’t know, don’t talk nonsense.

It is best for the model to be able to answer based on its own internal knowledge, that is, knowing what it knows and what it doesn’t know. This requires RL to train, because supervised learning will teach the model to lie.

Why, because there are two situations in the middle of SL:
The model itself has relevant knowledge and knows the answers. Then during the training process, it connects the answers with the questions. Afterwards, it can also see similar questions and give relevant knowledge answers. It also feels that the generalization is good. This is a good situation.
The model itself has no relevant knowledge, which is equivalent to learning knowledge that was completely unknown before. Because the amount of data is relatively small, it may be simply to remember a small amount of data. Then when encountering similar problems later, the model actually has no relevant knowledge, but the model is still will tend to answer, then you can only make it up, which is why SL will teach the model to lie

Here is also John Schulman’s talk at UCB: https://www.youtube.com/watch?v=hhiLw5Q_UFg, interested students can watch it.