Introduction to Transformer – Bert

Bert

BERT, the full name of “Bidirectional Encoder Representations from Transformers”, is a method of pre-training language representation, which means that we train a general “language understanding” model on a large text corpus (such as Wikipedia), and then the model Used for the downstream NLP tasks we care about (such as question answering). BERT outperforms previous traditional NLP methods because it is the first unsupervised, deep bidirectional system for pre-trained NLP.

Bert utilizes unlabeled modeling, and when it comes to unlabeled and labeled data in NLP corpora, a common example is the text classification task. Below is an example:

Unlabeled data: Large collections of text from Wikipedia that have not been labeled or annotated by humans. These data can be used to pre-train language models, such as BERT, GPT, etc., to learn general language representations.

Labeled data: such as the IMDB movie review dataset, where each review is labeled with positive or negative sentiment. This data can be used to train a text classifier that can automatically classify the sentiment of new comments.

Bert’s unsupervised pre-training method is important because it allows us to train on large-scale text corpora without collecting large amounts of labeled data for each specific NLP task. Since there is a large amount of unlabeled text data on the network, this method allows BERT to use this data to train the model and improve the performance of the model on various tasks. At the same time, unsupervised pre-training can also improve the model’s ability to understand and express language. BERT’s unsupervised pre-training method allows us to use a general “language understanding” model for various NLP tasks, such as question answering, text classification, entity recognition, etc., without retraining a new model for each task. The main task of BERT is to predict missing words in text sequences by training the model, so the model only needs to encode the input text sequence and does not need to decode the sequence. He only used the Encoder in Transformer.

Pre-training BERT

Pre-training BERT (Bidirectional Encoder Representations from Transformers) is a pre-training method based on large-scale unlabeled text data. The purpose is to train a general language model that can understand the semantic information of words in the context.

In the pre-training process of BERT, two stages of training methods are used: Masked LM and Next Sentence Prediction.

(1) Masked LM (MLM)

Masked Language Modeling (MLM) is a pre-training method of the BERT model. By randomly replacing certain words in the input text with special [MASK] tags, the model needs to be based on contextual information when predicting the replaced words. Inference, thereby learning the contextual relevance of words.

Specifically, for each word in the input text, randomly select it with a certain probability (such as 15%) to replace it with the [MASK] tag, and then input the text processed by the [MASK] tag to the model for prediction. During the prediction process, the model needs to guess the words replaced by [MASK] tokens based on the context information. Such a prediction task can prompt the model to learn the contextual information of words, thereby improving the performance of downstream tasks.

For example, for the input text “I went to the [MASK] to buy some apples”, the model needs to guess based on the contextual information that the words replaced by the [MASK] token are “store”, “market”, “shop”, etc. Through such prediction tasks, the model can learn the meaning and usage of “store”, “market”, and “shop” in different contexts, thereby improving the performance of downstream tasks (such as text classification, sentiment analysis, question answering, etc.). Because modeling in a context-free manner means that the model does not consider the position and context of the word in the sentence, but encodes each word independently into a fixed vector representation. So MLM is crucial because he learns contextual information.

(2) Next Sentence Prediction (NSP)

Next Sentence Prediction (NSP) is another pre-training task in BERT, which is used to train the model to learn the relationship between sentences. Its goal is to judge whether two sentences are adjacent, that is, to judge whether a sentence is the next sentence of another sentence.

During the training of NSP, for each pair of input sentences, half are adjacent and the other half are randomly selected non-adjacent sentences. The model needs to make classification predictions for both cases. This task is mainly to help the model learn better semantic representation, especially for tasks that need to understand the relationship between multiple sentences, such as question answering and text reasoning.

The input text will be divided into multiple tokens, and each token will be mapped to a vector representation, which is called token embedding.

In addition to token embedding, BERT has two other embeddings, namely sentence embedding and positional embedding.

Sentence embedding is a vector representation obtained by encoding the entire sentence. In BERT, for an input sentence pair (such as question and answer in a question-and-answer scenario), BERT will add a special token “[SEP]” between the sentence pair to the middle, and add another special token at the beginning of the entire input sequence Mark “[CLS]”. Then, the vector corresponding to the first token of the entire sequence is the sentence embedding of the entire sentence.

Positional embedding is used to indicate the position information of each token in the sentence. Since the Transformer does not retain the location information of the token in the input, it is necessary to add location information to the input token embedding so that the Transformer can capture the location information of the token in the sentence. A relative position encoding method is adopted here, and the relative position information of each token and other tokens is encoded into a vector, and then the vector is added to the token embedding.

In BERT, these three embeddings will be spliced together and then sent to Transformer for encoding.
In BERT, these three embeddings will be spliced together and then sent to Transformer for encoding. These encodings represent features of symbolic sentence positions.

By using two tasks of MLM and NSP to pre-train the model, BERT is able to learn richer language representations, which can be fine-tuned in various NLP tasks.

Fine-tuning BERT

Fine-tuning BERT refers to the process of using a pre-trained BERT model and further tuning it for a specific task. This process can be understood as fine-tuning on the basis of BERT to make it more suitable for specific natural language processing (NLP) tasks.

The main steps of Fine-tuning BERT are as follows:

  1. Prepare data sets: According to specific NLP tasks, prepare corresponding data sets, including training sets, verification sets, and test sets.
  2. Define the task: According to the task type, select the appropriate BERT model and Fine-tuning strategy. For classification tasks, you can use BERT’s CLS vectors to represent the entire sentence, and add a fully connected layer to predict the label. For sequence labeling tasks, a sequence labeling layer can be added on the basis of BERT.
  3. Fine-tuning: Input the prepared data set into the BERT model for Fine-tuning. In the Fine-tuning process, the parameters of the BERT model are fine-tuned to suit specific NLP tasks. Model optimization is usually performed using the backpropagation algorithm.
  4. Model evaluation: Use the verification set to evaluate the model performance after Fine-tuning, and adjust the Fine-tuning strategy or the hyperparameters of the BERT model according to the performance of the verification set. Finally, the performance of the model is evaluated using the test set.

It should be noted that Fine-tuning BERT requires a lot of computing resources and time, because the BERT model itself has a lot of parameters and complex structure. In addition, the performance of Fine-tuning BERT also depends on factors such as the complexity of the task, the quality of the dataset, and the choice of the model.

Code Implementation

The following is a sample code to implement the Bert model using PyTorch, including model definition, data preprocessing, model training and inference:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import transformers

# 1. Define the Bert model
class BertModel(nn.Module):
    def __init__(self, bert_config):
        super(BertModel, self).__init__()
        self.bert = transformers.BertModel(bert_config)
        self.dropout = nn.Dropout(bert_config.hidden_dropout_prob)
        self.fc = nn.Linear(bert_config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        output = output[1] # Take the first tensor as the CLS vector
        output = self. dropout(output)
        output = self.fc(output)
        return output

# 2. Data preprocessing
# Dataset contains input sequences and corresponding labels
inputs = ["I love Python programming.", "Python is a high-level programming language."]
labels = [0, 1] # 0 means that the first sentence is not about Python programming, 1 means that the second sentence is about Python programming

tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
max_seq_length = 64 # input sequence maximum length
inputs_ids = []
attention_masks = []

for input_text in inputs:
    # Convert text to ids and attention mask
    encoded_dict = tokenizer.encode_plus(
        input_text,
        add_special_tokens=True,
        max_length=max_seq_length,
        pad_to_max_length=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    inputs_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

inputs_ids = torch.cat(inputs_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# 3. Define hyperparameters and optimizer
num_classes = 2
learning_rate = 2e-5
num_epochs = 3
batch_size = 2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = BertModel(transformers.BertConfig.from_pretrained('bert-base-uncased')).to(device)
optimizer = optimize.Adam(model.parameters(), lr=learning_rate)

# 4. Define the data loader
dataset = data.TensorDataset(inputs_ids, attention_masks, labels)
dataloader = data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# 5. Train the model
model. train()
for epoch in range(num_epochs):
    for i, (input_ids_batch, attention_masks_batch, labels_batch) in enumerate(dataloader):
        input_ids_batch = input_ids_batch.to(device)
        attention_masks_batch = attention_masks_batch.to(device)
        labels_batch = labels_batch.to(device)

        optimizer. zero_grad()

        outputs = model(input_ids_batch, attention_masks_batch)
        loss = nn.CrossEntropyLoss()(outputs, labels_batch)
        loss. backward()

        optimizer. step()

        if i % 10 == 0:
            print(f"Epoch {<!-- -->epoch}, batch {<!-- -->i}, loss: {<!-- -->loss.item()}")

# 6. Model reasoning
model.eval()
test_input = "Python is a popular programming language."
encoded_dict = tokenizer.encode_plus(
    test_input,
    add_special_tokens


el.eval()
test_input = "Python is a popular programming language."
encoded_dict = tokenizer.encode_plus(
    test_input,
    add_special_tokens