Frontier heavy weapon[33] | Tried a simple prompt

Frontier heavy weapon

The column is mainly to share with you the papers and sharing of various major manufacturers and top conferences, extract the key essence from them and share them with you, and grasp the cutting-edge technology with you. Specific introduction: Cangjie special project: I know how to fly and cannon, and I still have sharp weapons and mind skills. (Counting it all together, it has been 20 years since the special launch!)

The collection of articles in 2022 has a cumulative total of 600,000 words, here: CS’s Shabby Room 600,000 words of original algorithm experience sharing-2022 edition.

Past review

Frontier heavy tool[28] | How is the frontier vector recall done?
Frontier heavy tool [29] | ERNIE-Search: Representative work of representational semantic matching for interactive learning
Frontier heavy tool [30] | Chat review – application of pre-training model in information retrieval
Frontier heavy weapon[31] | Chat GPT rationally
Frontier heavy weapon[32] | Out-of-domain intent detection – solving the problem of “unseen”

Prompt should not be considered a new thing at this stage, and everyone’s attention has also been on other follow-up research. Some time ago, I thought about expanding the tool library, so I began to want to try the effect of prompt.

Lazy Directory:

Let me talk about the principle first
the code
code details
Some meaningful conclusions about the experiment
summary
reference article

Let’s talk about the principle first

The so-called prompt, to put it simply and generally speaking, is actually to transform the traditional NLP problem into a “cloze” similar to what we did before, and then use the MLM task to predict the corresponding result. Take text classification as an example, such as the two classifications of positive and negative reviews. The conventional way is to input sentences into the model and let the model predict positive and negative. In the prompt, we add some content to the sentence, and then predict the empty space To judge positive or negative.

For example, a sentence “I think this product is the best product I bought”, at this time, add some supplements to the sentence, for example, change it to “Sentence: I think this product is the most right product I bought. This is a [MASK] comment”, at this time we only need to compare the probability of “good” and the probability of “poor” here to analyze the final result.

Do you think the principle is super simple, then the code will be posted later.

Code

The first is some more conventional configuration in advance, including some hyperparameters and loading of pre-trained models.

# hyperparameters
hidden_dropout_prob = 0.3
num_labels = 2
learning_rate = 1e-5
weight_decay = 1e-2
epochs = 15
batch_size = 16
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
prefix = "This is too [MASK]," # The configuration of the prompt is reasonable. In order to simplify it, I directly use the prefix form. In fact, if you have time to change it, it can also be a suffix.
maskpos = 4

# pre-trained model path
ptm_path = "./data/ptms/bert-base-chinese/"
vocab_file = ptm_path + "vocab.txt" # Vocabulary
tokenizer = BertTokenizer(vocab_file)
config = BertConfig.from_pretrained(ptm_path + "config.json")
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Bert_Model(bert_path=ptm_path + "pytorch_model.bin", config_file=config).to(device)

# Extract the id of the vocabulary represented by the positive and negative classes in advance, so that the probability can be extracted later.
pos_id = tokenizer.convert_tokens_to_ids('stick')
neg_id = tokenizer.convert_tokens_to_ids('poor')

# It's just an experiment, so the training configuration here is very default and simple.
loss_func = nn.CrossEntropyLoss(ignore_index=-1)
optimizer = AdamW(model.parameters(),lr=2e-5,weight_decay=1e-4) #Use Adam optimizer

The pre-training model here uses this structure. Here we use BertForMaskedLM, which is the structure of an MLM model. I believe everyone is familiar with it. There are many task types integrated in the transformers library. The basic structure is very convenient to use.

from transformers import BertForMaskedLM
class Bert_Model(nn.Module):
    def __init__(self, bert_path, config_file):
        super(Bert_Model, self).__init__()
        print(bert_path)
        self.bert = BertForMaskedLM.from_pretrained(bert_path,config=config_file) # load pre-trained model weights

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(input_ids, attention_mask, token_type_ids)
        logit = outputs.logits # pooled output [bs, config.hidden_size]

        return logit

The data set is actually quite a part. I put the splicing in this step, and the specific function is in this prompt_dataset.

# training set arrangement
Inputid, Labelid, sid, atid = prompt_dataset(x_train, y_train, prefix, tokenizer, maskpos)
Inputid = np.array(Inputid)
Labelid = np.array(Labelid)
sid = np.array(sid)
atid = np.array(atid)
# Be lazy, the validation set is the same as the training set
input_ids_train, input_ids_valid = Inputid, Inputid
input_masks_train, input_masks_valid = atid, atid
input_types_train, input_types_valid = sid, sid
label_train, y_valid = Labelid, Labelid

Let’s take a look at how to write prompt_dataset. The special thing here is actually text_ = prefix + x[i], which is to splice the prefix and the sentence together, and then the following is A more conventional conversion.

def prompt_dataset(x, y, prefix, tokenizer, maskpos):
    # prompt x: raw data input, y: output, prefix: prefix, tokenizer: converter
    inputid = []
    Labelid = []
    sid = []
    atid = []
    for i in range(len(x)):
        text_ = prefix + x[i]
        encode_dict = tokenizer.encode_plus(text_, max_length=200, padding='max_length', truncation=True, add_special_tokens=True)

        id = encode_dict["input_ids"]
        segmentid = encode_dict["token_type_ids"]
        attid = encode_dict["attention_mask"]
        labelid, inputid = id[:], id[:]
        if y[i] == 0:
            labelid[maskpos] = neg_id
            labelid[: maskpos] = [-1]*len(labelid[: maskpos])
            labelid[maskpos + 1 : ] = [-1]*len(labelid[maskpos + 1 : ])
            inputid[maskpos] = tokenizer.mask_token_id
        else:
            labelid[maskpos] = pos_id
            labelid[: maskpos] = [-1] * len(labelid[: maskpos])
            labelid[maskpos + 1:] = [-1] * len(labelid[maskpos + 1:])
            inputid[maskpos] = tokenizer.mask_token_id
        Labelid.append(labelid)
        Inputid.append(inputid)
        sid.append(segmentid)
        atid.append(attid)
    
    return Inputid, Labelid, sid, atid

For convenience, I also wrote a function to predict a single case. It is very convenient for self-testing in normal times. The probability used here

def pred_single(model, data_info, maskpos, pos_id, neg_id):
    ids, att, tpe= list2cuda(data_info["Inputid"]), list2cuda(data_info["atid"]), list2cuda(data_info["sid"])
    out = model(ids, att, tpe)
    tout_train_mask = out[:, maskpos, :] # predicted value, here is the probability of all tokens at this position.
    pos_score = tout_train_mask[:,pos_id].cpu().detach().numpy().tolist() # Probability of positive keywords
    neg_score = tout_train_mask[:,neg_id].cpu().detach().numpy().tolist() # Probability of negative keywords
    # print(pos_score, neg_score)
    pred = cal_pred(pos_score, neg_score)
    return pred

def list2cuda(data):
    return torch.from_numpy(np.array(data)).long().to(device)

def cal_pred(pos_score, neg_score):
    # Calculate the probability of positive and negative classes, take the higher
    # print(pos_score, neg_score)
    pred = []
    for idx in range(len(pos_score)):
        if pos_score[idx] >= neg_score[idx]:
            pred.append(1)
        else:
            pred.append(0)
    return pred

Then there is the more critical training code. Before starting, I need to say that this training is not in a hurry. You can try the effect directly without supervision. In fact, as long as you design a better prompt, my experiment can be done. Reaching the level of 80% is not a high level, but in the case of unsupervised and little data, it is already a very high baseline, which is very worth absorbing and learning.

The following is the highlight, training. In fact, the training part is relatively simple. The loss function is cross entropy (defined above). We hope that the keywords corresponding to the category have as high a probability as possible in this sentence. Through this The way to train, the rest depends on the code understanding:

def train(model, epoch, optimizer, dataset, device, loss_func):
    starttime_train = datetime.now()
    start = time. time()
    correct = 0
    train_loss_sum = 0.0
    model. train()
    schedule = get_cosine_schedule_with_warmup(optimizer,num_warmup_steps=len(dataset),num_training_steps=epoch*len(dataset))
    logger.info("***** Running training epoch {} *****".format(epoch + 1))
    for idx, (ids, att, tpe, y) in enumerate(tqdm(dataset)):
        ids, att, tpe, y = ids.to(device), att.to(device), tpe.to(device), y.to(device)
        out_train = model(ids, att, tpe)
        # print(out_train. view(-1, 21128). shape, y. view(-1). shape)
        loss = loss_func(out_train. view(-1, 21128), y. view(-1))
        optimizer. zero_grad()
        loss. backward()
        optimizer. step()
        schedule. step()
        train_loss_sum += loss.item()

        if (idx + 1) % 100 == 0:
            logger.info("Epoch {:04d} | Step {:06d}/{:06d} | Loss {:.4f} | Time {:.0f}".format(
                epoch + 1, idx + 1, len(dataset), train_loss_sum / (idx + 1), time.time() - start))

        truelabel = y[:, maskpos]
        out_train_mask = out_train[:, maskpos, :]

        predicted = torch.max(out_train_mask.data, 1)[1]
        correct + = (predicted == truelabel).sum()
        correct = np. float(correct)
    acc = float(correct / len(label_train))

In fact, if you look at the code, you will find that it is very routine. It is the training process of the general MLM task, so that the token predicted by the model is as close as possible to the label.

Code details

In the process of writing this code, it was actually quite twists and turns, and there are still many details to pay attention to.

Several common model structures encapsulated by Transformers, classification, sentence equivalence, etc., all need to be familiar with, including the BertForMaskedLM used this time. You can learn directly through documents and other means.
The conversion of various data types and devices does not seem to be proficient enough, that is, tout_train_mask[:,pos_id].cpu().detach().numpy().tolist().
In addition, please note that this kind of script is just for running through and making toys. This code style is not worth learning, just refer to the technical solution.
Training is not necessary. You can try to predict directly without training. A good prompt will have a good baseline.

Some meaningful conclusions after the experiment

I will not put the experimental results here, but I will list some meaningful findings directly for you to use as a reference for your experiments:

In the case of no training, changing several prompts can get a good result, which allows us to get a good baseline even in a difficult environment. (My experimental upper limit F1 is around 80%)
In the case of no training, different prompts have a very large gap in the results, and the lower limit can reach 55%, so if you do not train, you need to spend some time on the design of the prompt.
The training situation may be that my data is relatively simple, so the gap between prompt and other models, such as bert-cls, is not very big, and it has not improved significantly, and it is related to the data.
Depending on the training situation, different prompts will also have an impact on the final effect, but it will not be that big. It is okay to adjust the closing time, and don’t spend too much time in the early stage.
Surprisingly, the training set is forcibly compressed, and I compress it to 100 here. At this time, the prompt scheme can still reach the level close to the full amount of data (of course, the epoch will increase a lot if the amount of data is less, but the convergence is actually quite fast. ), probably at the level of a poor prompt.
This effect can be achieved in the fewshot scene of this compressed data set, which cannot be achieved by classic methods such as bert-cls and textcnn, so you can try this solution for small data sets.
The above effect is based on a large pre-trained model. Some relatively small results such as CNN do not seem to have this effect. Everyone needs to pay attention.

Summary

This attempt has refreshed my tool library. It is a relatively new solution with a new adaptation scenario. In the few-shot scenario, this solution has very surprising results. This can be said to be relatively It is a big selling point, and I recommend everyone to try it. In addition to classification tasks, tasks such as ner can actually be tried.

Reference article

Su Jianlin: The pre-training task NSP, which was once disliked, made an excellent Zero Shot effect, https://spaces.ac.cn/archives/8671
Xie Liyang: Preliminary Exploration of Prompt, https://zhuanlan.zhihu.com/p/464562384
Xie Liyang: Prompt Chinese classification task engineering attempt, https://zhuanlan.zhihu.com/p/464684532