NLP (Seventy) Fine-tuning Multiple Choice MRC using LLAMA 2 model

This article will introduce how to use the LLAMA-2 7B model in the Firefly large model training framework to fine-tune the multiple-choice reading comprehension data set RACE middle, and the final effect is significantly improved.

MRC

Machine Reading Comprehension (MRC) belongs to Question Answering (QA) in NLP tasks and is a basic and important task in the field of NLP. The so-called machine reading comprehension is to give an article and a question based on the article, and let the machine answer the question after reading the article.

According to different task forms, they can be divided into:

Cloze Tests: Hide certain words in the article and let the model determine which of the hidden words is most likely to be based on the context.
Multiple Choice: Given an article and a question, let the model select the one most likely to be the correct answer from multiple alternative answers.
Span Extraction: Given an article and a question, let the model extract a continuous sequence of words from the article and make this sequence the answer to the question as much as possible.
Free Answering: Given an article and a question, let the model generate a word sequence and make this sequence the answer to the question as much as possible. Different from the fragment extraction task, the sequence is no longer restricted to sentences in the article.

Among them, Cloze is one of the pre-training tasks MLM in the BERT model. Therefore, the BERT series of models naturally supports cloze.

Fragment extraction is extractive reading comprehension. It is the most common form of reading comprehension in our NLP tasks. Reading comprehension in general also refers to this form. This task can also be accomplished well using the BERT series of models. Classic data sets for this task include SQUAD, SQUAD 2, etc.

Free answer The answer to the unqualified reply comes from the original article, so it is the most difficult. Generative models are generally used to solve the problem. Previous NLP models did not perform well in this regard. The emergence of large models (LLM) has greatly improved this task, and the effect has become more readable, which is a revolutionary change.

Multiple choice is our reading comprehension in English exams, reading articles, given questions, and choosing the correct answer from fixed options. Classic multiple-choice reading comprehension data sets include RACE, SWAG, etc. This article will introduce in detail the RACE data set in multiple-choice reading comprehension and the fine-tuning effect of LLAMA-2 on this data set.

RACE data set

The RACE data set is a classic data set in Multi Choice MRC. The introduction on the RACE official website is:

Race is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be served as the training and test sets for machine comprehension.

The RACE data set is divided into two parts: middle (junior high school) and high (high school), as follows:

Dataset	train	dev	test
middle	25421	1436	1436
high	62445	3451	3498
total	87866	4887	4934

Due to data size and training time, this article only fine-tunes the middle data set. Randomly select a sample from the training set in the middle data set, as follows:

{<!-- -->'example_id': 'middle4558.txt',
 'article': '"I planted a seed. Finally grow fruits. Today is a great day. Pick off the star for you. Pick off the moon for you. Let it rise for you every day. Become candles burning myself. Just light you up, hey!... You are my little little apple. How much I love you, still no enough."\\
This words are from the popular song You Are My Little Dear Apple. Bae Seul- Ki acted as the leading dancer in the MV of the song. She loves dancing. She became crazy about hip-hop when she was a school girl.\\
Bai Seul-Ki was born on September 27, 1986. She is a South Korean singer and dancer. She is 168cm tall. She loves cooking. Her favorite food is spicy and salty. She like pink and red most. There are five members in her family---father, mother, two younger brothers and herself. She isn't \'t married.\\
After her father and mother broke up, she lived with her mother and new daddy. She enjoys being alone.',
 'answer': 'B',
 'question': 'Bae Seul-Ki _ in the MV of the song according to the passage.',
 'options': ['sang', 'danced', 'cried', 'laughed']}

Build Prompt

In this article, the Prompt constructed for the RACE data set is as follows:

Read the following passage and questions, then choose the right answer from options, the answer should be one of A, B, C, D.

<passage>:
"I planted a seed. Finally grow fruits. Today is a great day. Pick off the star for you. Pick off the moon for you. Let it rise for you every day. Become candles burning myself. Just light you up, hey !... You are my little little apple. How much I love you, still no enough."
This words are from the popular song You Are My Little Dear Apple. Bae Seul-Ki acted as the leading dancer in the MV of the song. She loves dancing. She became crazy about hip-hop when she was a school girl.
Bai Seul-Ki was born on September 27, 1986. She is a South Korean singer and dancer. She is 168cm tall. She loves cooking. Her favorite food is spicy and salty. She like pink and red most. There are five members in her family---father, mother, two younger brothers and herself. She isn't married.
After her father and mother broke up, she lived with her mother and new daddy. She enjoys being alone.

<question>:
Bae Seul-Ki _ in the MV of the song according to the passage.

<options>:
A sang
B danced
C cried
D laughed

<answer>:

Trick: You can use large models, such as GPT-4, to build good prompts to improve the fine-tuning effect.

LLAMA-2 model fine-tuning

In the field of large models, the LLAMA series of models is famous. There are dozens of them and their derived models, which are really dazzling, and the effects are very good. They are the real masters in the field of large models. We will have the opportunity to introduce the LLAMA series models again, so we will not go into details in this article.

This article uses the newly open source LLAMA 2 model, version 7B, to fine-tune the RACE middle data set under the Firefly framework. The parameters for fine-tuning are as follows:

{<!-- -->
    "output_dir": "output/firefly-llama2-7b-race-middle",
    "model_name_or_path": "/home/jclian91/Llama-2-7b-hf",
    "train_file": "./data/race_train.jsonl",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 8,
    "gradient_accumulation_steps": 4,
    "learning_rate": 1e-4,
    "max_seq_length": 384,
    "logging_steps": 100,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.05,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "fp16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 10,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

In the article NLP (63) using the Baichuan-7b model to fine-tune the human relationship classification task, we have introduced in detail the steps of Firefly model fine-tuning, evaluation, WEB services, etc., and will not go into details here. You can also refer to the Github project llama-2-multiple-choice-mrc.

The accuracy rate on the RACE test data set is 86.91%! The effect is amazing, and this is just the result of LLAMA 2 without parameter adjustment, so the effect is not bad. The author previously used the BERT model to fine-tune on the RACE middle data set. The evaluation results were generally only about 72%, and the BERT Large model was only 75%. This is still the result of using middle + high for the training data.

The RACE data set rankings are as follows:

RACE Data Set Ranking

Model effectiveness evaluation

Finally, we evaluate on new articles. We randomly selected a junior high school English reading comprehension article from the Internet, as follows:

Edward rose early on the New-year morning. He looked in every room and wished a Happy New Year to his family. Then he ran into the street to repeat that to those he might meet.

When he came back, his father gave him two bright, new silver dollars.

His face lighted up as he took them. He had wished for a long time to buy some pretty books that he had seen at the bookstore.

He left the house with a light heart, expecting to buy the books. As he ran down the street, he saw a poor family.

"I wish you a Happy New Year." said Edward, as he was passing on. The man shook his head.

"You are not from this country." said Edward. The man again shook his head, for he could not understand or speak his language. But he pointed to his mouth and to the children shaking with cold, as if (seems) to say , “These little ones have had nothing to eat for a long time.”

Edward quickly understood that these poor people were in trouble. He took out his dollars and gave one to the man, and the other to his wife.

They were excited and said something in their language, which doubtless meant, “We thank you so much that we will remember you all the time.”

When Edward came home, his father asked what books he had bought. He hung his head for a moment, but quickly looked up.

"I have bought no books", said he. "I gave my money to some poor people, who seemed to be very hungry then." He went on, "I think I can wait for my books till next New Year."

“My dear boy,” said his father, “here are some books for you, more as a prize for your goodness of heart than as a New-year gift”

"I saw you give the money cheerfully to the poor German family. It was nice for a little boy to do so. Be always ready to help others and every year of your life will be to you a Happy New Year."

The four questions are as follows:

48. Edward expected to _________ with the money he got from his father.

A. help the poor family B. buy something to eat

C. buy some pretty books D. learn another language

49. Why did the poor man shake his head when Edward spoke to him?

A. He couldn’t understand the boy B. He wouldn’t accept the money

C. He didn’t like the boy’s language D. He was too cold to say anything

50. How much did Edward give to the poor family?

A. One dollar B. Two dollars C. Three dollars D. Four dollars

51. We know that Edward____________ from the passage?

A. got a prize for his kind heart B. had to buy his books next year

C. bought the books at the bookstore D. got more money from his father

The answer given by the fine-tuning model is CABA, which is consistent with the reference answer!

Summary

Multiple-choice reading comprehension has been the author’s focus in the past few years. In the previous BERT era, the effect of the BERT model was not very good. Megatron-BERT raised the indicator to about 90%. Too much effort was put into it, and the effect was It’s not easy to reproduce either. The author has been making efforts in this area, but has not been successful due to limitations of machine resources or model parameters. The emergence of LLAMA series models makes everything so easy and comfortable. Although the effect is not SOTA, it is undoubtedly satisfactory. Large models are undoubtedly the trend of artificial intelligence in the future and the trendsetter of today’s era!

This article mainly introduces the MRC and RACE data sets, and also introduces how to use the LLAMA 2 model for fine-tuning under the Firefly training framework and achieve satisfactory results.

Welcome to follow my public account NLP Fantasy Journey, original technical articles will be pushed out as soon as possible.

Welcome to follow my knowledge planet “Natural Language Processing Fantasy Journey“. The author is working hard to build his own technical community.