Langchain-Chachat project: 4.1-P-Tuning v2 implementation process

What are the common parameter-efficient fine-tuning methods (Parameter-Efficient Fine-Tuning, PEFT)? Mainly Prompt series and LoRA series. This article mainly introduces the P-Tuning v2 fine-tuning method. As follows:

  • Prompt series, for example, Prefix Tuning(2021.01-Stanford), Prompt Tuning(2021.09-Google), P-Tuning(2021.03-Tsinghua), P-Tuning v2(2022.03-Tsinghua);
  • LoRA series, for example, LoRA(2021.11-Microsoft), AdaLoRA(2023.03-Microsoft), QLoRA(2023.05-Washington).
  • There are also things that I don’t know how to classify, such as BitFit, Adapter Tuning and its variants, MAM Adapter, UniPELT, etc.

1. Working principle of P-Tuning v2
1.How to design Hard/Soft Prompt-Tuning
The development of prompt engineering has gone through a process from the hard prompt design of manual or semi-automatic discrete space to the soft prompt design of continuous differentiable space. The advantage of this is that the prompt parameters corresponding to different tasks can be learned through end-to-end optimization.
2. Working principle and shortcomings of P-Tuning
Mainly apply continuous prompt to the input layer of the pre-trained model. Each layer after the pre-trained model does not incorporate continuous prompt.

3. How P-Tuning v2 solves the shortcomings of P-Tuning
P-Tuning v2 applies continuous prompts to each layer of the pre-trained model, not just the input layer.

2.P-Tuning v2 implementation process
1. Overall project structure
Source code reference [4], the source code structure is as follows:

The parameters are explained as follows:
(1) –model_name_or_path L:/20230713_HuggingFaceModel/20231004_BERT/bert-base-chinese: BERT model path
(2) –task_name qa: task name
(3) –dataset_name squad: Dataset name
(4) –do_train: training process
(5) –do_eval: verification process
(6) –max_seq_length 128: Maximum sequence length
(7) –per_device_train_batch_size 2: training batch size for each device
(8) –learning_rate 5e-3: learning rate
(9) –num_train_epochs 10: Number of training epochs
(10) –pre_seq_len 128: Prefix sequence length
(11) –output_dir checkpoints/SQuAD-bert: checkpoint output directory
(12) –overwrite_output_dir: Overwrite the output directory
(13) – hidden_dropout_prob 0.1: Hide dropout probability
(14) –seed 11: seed
(15) –save_strategy no: save strategy
(16) –evaluation_strategy epoch: Evaluation strategy
(17) –prefix: P-Tuning v2 method
The execution code is as follows:

python3 run.py --model_name_or_path L:/20230713_HuggingFaceModel/20231004_BERT/bert-base-chinese --task_name qa --dataset_name squad --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 2 --learning_rate 5e-3 --num_train_epochs 10 --pre_seq_len 128 --output_dir checkpoints/SQuAD-bert --overwrite_output_dir --hidden_dropout_prob 0.1 --seed 11 --save_strategy no --evaluation_strategy epoch --prefix

2. Code execution process
(1)P-tuning-v2/run.py

  • Select tasks.qa.get_trainer based on task_name==”qa”
  • Get the trainer according to get_trainer, then train, evaluate and predict

(2) P-tuning-v2/tasks/qa/get_trainer.py

  • Get config, tokenizer, model, squad data set, QuestionAnsweringTrainer object trainer
  • Focus on how the model is obtained
# fix_bert means not to update the bert parameter, and the model data type is BertPrefixForQuestionAnswering
model = get_model(model_args, TaskType.QUESTION_ANSWERING, config, fix_bert=True)
  • Focus on the specific implementation of QuestionAnsweringTrainer
trainer = QuestionAnsweringTrainer( # Read trainer
    model=model, # model
    args=training_args, # Training parameters
    train_dataset=dataset.train_dataset if training_args.do_train else None, # Training set
    eval_dataset=dataset.eval_dataset if training_args.do_eval else None, # Validation set
    eval_examples=dataset.eval_examples if training_args.do_eval else None, # Validation set
    tokenizer=tokenizer, # tokenizer
    data_collator=dataset.data_collator, # used to convert data into batch
    post_process_function=dataset.post_processing_function, # used to convert prediction results into final results
    compute_metrics=dataset.compute_metrics, # used to calculate evaluation indicators
)

(3)P-tuning-v2/model/utils.py
Select the P-tuning-v2 fine-tuning method and return the BertPrefixForQuestionAnswering model, as shown below:

def get_model(model_args, task_type: TaskType, config: AutoConfig, fix_bert: bool = False):
    if model_args.prefix: # Training method 1: P-Tuning V2 (prefix=True)
        config.hidden_dropout_prob = model_args.hidden_dropout_prob # 0.1
        config.pre_seq_len = model_args.pre_seq_len # 128
        config.prefix_projection = model_args.prefix_projection # False
        config.prefix_hidden_size = model_args.prefix_hidden_size # 512
        # task_type is TaskType.QUESTION_ANSWERING, config.model_type is bert, model_class is BertPrefixForQuestionAnswering
        model_class = PREFIX_MODELS[config.model_type][task_type]
        # model_args.model_name_or_path is bert-base-chinese, config is BertConfig, revision is main
        model = model_class.from_pretrained(model_args.model_name_or_path, config=config, revision=model_args.model_revision,)

(4) P-tuning-v2/model/question_answering.py (emphasis)
Mainly BertPrefixForQuestionAnswering(BertPreTrainedModel) model structure, including constructor, forward propagation and obtaining prefix information.
(5) P-tuning-v2/model/prefix_encoder.py (Key Points)
The prefix encoder PrefixEncoder(config) is involved in the BertPrefixForQuestionAnswering(BertPreTrainedModel) constructor.
(6)P-tuning-v2/training/trainer_qa.py
The inheritance relationship is QuestionAnsweringTrainer(ExponentialTrainer)->ExponentialTrainer(BaseTrainer)->BaseTrainer(Trainer)->Trainer. The core training method is as follows:

3.P-tuning-v2/model/prefix_encoder.py implementation
The main function of this type is to encode it based on the prefix information. If the batch-size is not considered, the encoded shape is (prefix-length, 2*layers*hidden). If prefix-length=128, layers=12, hidden=768, then the encoded shape is (128,2*12*768).

class PrefixEncoder(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.prefix_projection = config.prefix_projection # Whether to use MLP to project prefix
        if self.prefix_projection: # Use two layers of MLP to project prefix
            self.embedding = torch.nn.Embedding(config.pre_seq_len, config.hidden_size)
            self.trans = torch.nn.Sequential(
                torch.nn.Linear(config.hidden_size, config.prefix_hidden_size),
                torch.nn.Tanh(),
                torch.nn.Linear(config.prefix_hidden_size, config.num_hidden_layers * 2 * config.hidden_size)
            )
        else: # Use Embedding directly for encoding
            self.embedding = torch.nn.Embedding(config.pre_seq_len, config.num_hidden_layers * 2 * config.hidden_size)

    def forward(self, prefix: torch.Tensor):
        if self.prefix_projection: # Use MLP to project prefix
            prefix_tokens = self.embedding(prefix)
            past_key_values = self.trans(prefix_tokens)
        else: # Do not use MLP to project prefix
            past_key_values = self.embedding(prefix)
        return past_key_values

There may be questions here, why do we need to multiply by 2? Because the first half of past_key_values needs to be spliced with key_layer, and the second half needs to be spliced with value_layer, as shown below:

key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
value_layer = torch.cat([past_key_value[1], value_layer], dim=2)

Description: The code path is in the forward() function of transformers/models/bert/modeling_bert.py->class BertSelfAttention(nn.Module).

4.P-tuning-v2/model/question_answering.py
Simple understanding, BertPrefixForQuestionAnswering adds PrefixEncoder to BERT. The get_prompt function is mainly to generate past_key_values, which is the encoded representation of prefix information, and is used to input the BERT model together with the main text sequence to help the model update Understand the questions and provide answers well. Because the selected SQuAD belongs to the extractive QA data set, that is, you can find the start and end positions of the answer from the context based on the question.

class BertPrefixForQuestionAnswering(BertPreTrainedModel):
    def __init__(self, config):
        self.bert = BertModel(config, add_pooling_layer=False) # bert model
        self.qa_outputs = torch.nn.Linear(config.hidden_size, config.num_labels) # Linear layer
        self.prefix_encoder = PrefixEncoder(config) # Prefix encoder

    def get_prompt(self, batch_size): # Generate the prefix encoding based on the prefix token, that is, key and value values
        past_key_values = self.prefix_encoder(prefix_tokens)
        past_key_values = past_key_values.view(
            bsz, # batch_size
            seqlen, # pre_seq_len
            self.n_layer * 2, # n_layer represents the number of layers of the BERT model
            self.n_head, # n_head represents the number of attention heads
            self.n_embd # n_embd represents the dimension of each header
        )
        return past_key_values

    def forward(self, ..., return_dict=None):
        past_key_values = self.get_prompt(batch_size=batch_size) # Get prefix information
        attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)
        outputs = self.bert(
            ...
            past_key_values=past_key_values,
        )
        return QuestionAnsweringModelOutput( # Return the model output, including loss, logits at the start position, logits at the end position, hidden states and attentions
            loss=total_loss,
            start_logits=start_logits,
            end_logits=end_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

The key point is outputs = self.bert(past_key_values=past_key_values), passing past_key_values into the BERT model, the main function is transformers/models/bert/modeling_bert.py->class In the forward() function of BertSelfAttention(nn.Module). Next, look at the past_key_values data structure, as shown below:

5.BertSelfAttention implementation
BERT network structure refers to Appendix 1. past_key_values is mainly spliced with the key and value in the BertSelfAttention part, as shown below:

(self): BertSelfAttention(
  (query): Linear(in_features=768, out_features=768, bias=True)
  (key): Linear(in_features=768, out_features=768, bias=True)
  (value): Linear(in_features=768, out_features=768, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

The specific past_key_values and key, value splicing implementation reference code is as follows:

After passing the BertSelfAttention part, the shape of the output outputs is the same as the shape of the original input, that is, they do not contain prefix information.

Attachment 1: BERT network structure
Print out the BERT model structure, as shown below:

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(21128, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) #embeddings layer is made of LayerNorm
    (dropout): Dropout(p=0.1, inplace=False) #embeddings layer is Dropout
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer( #BertLayer includes BertAttention, BertIntermediate and BertOutput
        (attention): BertAttention( #BertAttention includes BertSelfAttention and BertSelfOutput
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

BERT model related class structure is in the file D:\Python310\Lib\site-packages\transformers\models\bert\modeling_bert.py, as shown below:

Attachment 2: SQuAD Dataset
SQuAD is a machine reading comprehension question and answer data set launched by Stanford University. The answer to each question comes from a piece of text corresponding to the reading passage, that is, (question, original text, answer). There are a total of 107,785 questions and 536 supporting articles. In addition to SQuAD 1.1, a new and more difficult version SQuAD 2.0 (“Know What You Don’t Know: Unanswerable Questions for SQuAD”_ACL2018) has also been launched.
(1) Training set data

(2) Verification set data

(3) Load the SQuAD data set

"""
Execute the script: python3 dataset_test.py --model_name_or_path L:/20230713_HuggingFaceModel/20231004_BERT/bert-base-chinese --task_name qa --dataset_name squad --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 2 --learning_rate 5e-3 --num_train_epochs 10 --pre_seq_len 128 --output_dir checkpoints/SQuAD-bert --overwrite_output_dir --hidden_dropout_prob 0.1 --seed 11 --save_strategy no --evaluation_strategy epoch --prefix
"""
from transformers import AutoTokenizer, HfArgumentParser, TrainingArguments

from arguments import get_args, ModelArguments, DataTrainingArguments, QuestionAnwseringArguments
from tasks.qa.dataset import SQuAD

if __name__ == '__main__':
    args = get_args() # Get arguments from the command line

    model_args, data_args, training_args, qa_args = args # model_args are model-related parameters, data_args are data-related parameters, training_args are training-related parameters
    tokenizer = AutoTokenizer.from_pretrained( # Read tokenizer
            model_args.model_name_or_path, # model name
            revision=model_args.model_revision, # Model version
            use_fast=True, # Whether to use fast tokenizer
        )
    dataset = SQuAD(tokenizer, data_args, training_args, qa_args)
    print(dataset)

Make a breakpoint and look at the dataset data structure as follows:

  • input_ids: The subscript list corresponding to the subword after tokenizer word segmentation
  • attention_mask: In the self-attention process, this mask is used to mark the difference between the sentence where the subword is located and the padding, and the padding part is filled with 0
  • token_type_ids: Mark the sentence where the subword is currently located (first sentence/second sentence/padding)
  • position_ids: mark the position index of the sentence where the current word is located
  • head_mask: used to invalidate certain attention calculations in certain layers
  • inputs_embeds: If provided, input_ids are not needed and will directly enter the Encoder calculation as Embedding across the embedding lookup process.
  • encoder_hidden_states: This part works when BertModel is configured as a decoder, and cross-attention will be executed instead of self-attention.
  • encoder_attention_mask: Same as above, used in cross-attention to mark the padding input on the encoder side.
  • past_key_values: It will be used in P-Tuning V2, mainly to splice the prefix encoding and the key and value of each layer of the pre-trained model.
  • use_cache: Save the previous parameter and pass it back to speed up decoding
  • output_attentions: Whether to return the attention output of each intermediate layer
  • output_hidden_states: Whether to return the output of each intermediate layer
  • return_dict: Whether to return output in the form of key-value pairs, the default is true.

I feel that there are still many knowledge points in P-Tuning v2 that have not been explained clearly, so I can only explain them one by one in the future. Just one P-Tuning v2 warehouse code involves a lot of knowledge points. The first is to be very familiar with the Transformer and BERT standard network structures, as well as various tasks and their data sets, and the BERT variant network structure. You must be familiar with the deep learning model training, verification, and testing processes of PyTorch and Transformer libraries, and you must be familiar with the Prompt series of fine-tuning methods. In short, you need to know all kinds of magic Transformers and BERT.

References:
[1]P-Tuning paper address: https://arxiv.org/pdf/2103.10385.pdf
[2]P-Tuning code address: https://github.com/THUDM/P-tuning
[3]P-Tuning v2 paper address: https://arxiv.org/pdf/2110.07602.pdf
[4]P-Tuning v2 code address: https://github.com/THUDM/P-tuning-v2
[5] Detailed explanation of BertLayer and Self-Attention: https://zhuanlan.zhihu.com/p/552062991
[6]https://rajpurkar.github.io/SQuAD-explorer/
[7]https://huggingface.co/datasets/squad