This article mainly introduces two text classification models: the BERT text classification basic model, and the magic modified model based on Bert and TextCNN. In the author’s actual work on text classification, the F1 value exceeded the Bert basic model by nearly 4%.
Technical exchange
Technology must be shared and communicated, and it is not recommended to work behind closed doors. One person can go very fast, and a group of people can go further.
Relevant information, data, and technical exchange improvements can be obtained by joining our communication group. The group has more than 2,000 members. The best way to make a note when adding is: source + direction of interest to facilitate finding like-minded friends.
Method ①, add WeChat ID: dkl88194, remarks: from CSDN + technical exchange
Method ②, WeChat search public account: Python learning and data mining, background reply: join the group
1. Baseline: Bert text classifier
The Bert model is a language model released by Google in October 2018. Once it came out, it swept the best results in 11 tasks in the NLP field, and it was an instant hit.
We will not go into details about the model details of the transformer in Bert here. Friends who are interested can read the article “The Illustrated Transformer”[1].
BERT single text classification model structure
1.1 BERT text classification model
The common practice of Bert text classification model is to use the first token position (CLS position) output by the last layer of Bert as the representation of the sentence, followed by full connection Layersfor classification. The model is very simple, let’s look directly at the code!
1.2 pytorch code implementation
# -*- coding:utf-8 -*- #bert text classification baseline model # model: bert import os import numpy as np import pandas as pd import torch import torch.nn as nn import torch.utils.data as Data import torch.optim as optim import transformers from transformers import AutoModel, AutoTokenizer import matplotlib.pyplot as plt train_curve = [] device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Define some parameters. The model chooses the most basic BERT Chinese model. batch_size = 2 epoches = 100 model = "bert-base-chinese" hidden_size = 768 n_class = 2 maxlen = 8 #data, construct some training data sentences = ["I like playing basketball", "This camera is very nice", "I had a lot of fun today", "I don't like you", "Too bad", " It's such a sad thing"] labels = [1, 1, 1, 0, 0, 0] # 1 positive, 0 negative. # word_list = ' '.join(sentences).split() # word_list = list(set(word_list)) # word_dict = {w: i for i, w in enumerate(word_list)} # num_dict = {i: w for w, i in word_dict.items()} #vocab_size = len(word_list) #Construct the data into bert's input format # inputs_ids: dictionary encoding of token # attention_mask: The length is consistent with inputs_ids, the real length position is filled with 1, and the padding position is filled with 0 # token_type_ids: The first sentence is filled with 0, the second sentence is filled with 1 class MyDataset(Data.Dataset): def __init__(self, sentences, labels=None, with_labels=True,): self.tokenizer = AutoTokenizer.from_pretrained(model) self.with_labels = with_labels self.sentences = sentences self.labels = labels def __len__(self): return len(sentences) def __getitem__(self, index): # Selecting sentence1 and sentence2 at the specified index in the data frame sent = self.sentences[index] # Tokenize the pair of sentences to get token ids, attention masks and token type ids encoded_pair = self.tokenizer(sent, padding='max_length', # Pad to max_length truncation=True, # Truncate to max_length max_length=maxlen, return_tensors='pt') # Return torch.Tensor objects token_ids = encoded_pair['input_ids'].squeeze(0) # tensor of token ids attn_masks = encoded_pair['attention_mask'].squeeze(0) # binary tensor with "0" for padded values and "1" for the other values token_type_ids = encoded_pair['token_type_ids'].squeeze(0) # binary tensor with "0" for the 1st sentence tokens & amp; "1" for the 2nd sentence tokens if self.with_labels: # True if the dataset has labels label = self.labels[index] return token_ids, attn_masks, token_type_ids, label else: return token_ids, attn_masks, token_type_ids train = Data.DataLoader(dataset=MyDataset(sentences, labels), batch_size=batch_size, shuffle=True, num_workers=1) # model class BertClassify(nn.Module): def __init__(self): super(BertClassify, self).__init__() self.bert = AutoModel.from_pretrained(model, output_hidden_states=True, return_dict=True) self.linear = nn.Linear(hidden_size, n_class) # Directly use the cls vector to connect the fully connected layer classification self.dropout = nn.Dropout(0.5) def forward(self, X): input_ids, attention_mask, token_type_ids = X[0], X[1], X[2] outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids) # Return an output dictionary #Use the last layer of cls vector for classification # outputs.pooler_output: [bs, hidden_size] logits = self.linear(self.dropout(outputs.pooler_output)) return logits bc = BertClassify().to(device) optimizer = optim.Adam(bc.parameters(), lr=1e-3, weight_decay=1e-2) loss_fn = nn.CrossEntropyLoss() #train sum_loss = 0 total_step = len(train) for epoch in range(epoches): for i, batch in enumerate(train): optimizer.zero_grad() batch = tuple(p.to(device) for p in batch) pred = bc([batch[0], batch[1], batch[2]]) loss = loss_fn(pred, batch[3]) sum_loss + = loss.item() loss.backward() optimizer.step() if epoch % 10 == 0: print('[{}|{}] step:{}/{} loss:{:.4f}'.format(epoch + 1, epoches, i + 1, total_step, loss.item())) train_curve.append(sum_loss) sum_loss = 0 # test bc.eval() with torch.no_grad(): test_text = ['I don't like playing basketball'] test = MyDataset(test_text, labels=None, with_labels=False) x = test.__getitem__(0) x = tuple(p.unsqueeze(0).to(device) for p in x) pred = bc([x[0], x[1], x[2]]) pred = pred.data.max(dim=1, keepdim=True)[1] if pred[0][0] == 0: print('negative') else: print('positive') pd.DataFrame(train_curve).plot() # loss curve
1.3 Results and code link
Single sample test results:
loss curve:
Relevant code links are as follows:
BERT text classification jupyter version[2]
BERT text classification pytorch version[3]
2. Optimization: Magic modification method based on Bert and TextCNN
2.1 TextCNN
Before the advent of Bert, TextCNN occupied a pivotal position in text classification models. This is because the CNN network can effectively capture the n-gram information in the text sequence, and the classification task is essentially to capture the n-gram permutation and combination features. Whether it is keywords, content, or the upper-level semantics of the sentence, they all exist in the form ofn-gramfeatures in the sentence.
TextCNN model structure
2.2 Magical Reform Ideas
After conducting experiments on Bert and TextCNN, the author was surprised to find that Bert can often better classify some sentences with obscure expressions, and TextCNN is often more sensitive to keywords. So the author modified the model and merged the ideas of Bert and TextCNN.
Bert-Base has 12 encoder layers except the first input layer. The first token (CLS) vector of each encoder layer can be used as a sentence vector. We can understand it abstractly as:
-
The shallower the encode layer, the better the sentence vector can represent low-level semantic information;
-
The deeper it is, the higher level semantic information it represents.
Our goal is to obtain both word-related features and semantic features. The specific approach of the model is to use the CLS vectors from layer 1 to layer 12 as input to the CNN and then classify them.
Fusion BERT-Blend-CNN
Without further ado, let’s just look at the code!
2.3 pytorch code implementation
# -*- coding:utf-8 -*- #bert combines textcnn ideas with Bert + Blend-CNN # model: Bert + Blend-CNN import os import numpy as np import pandas as pd import torch import torch.nn as nn import torch.utils.data as Data import torch.nn.functional as F import torch.optim as optim import transformers from transformers import AutoModel, AutoTokenizer import matplotlib.pyplot as plt train_curve = [] device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # # Define some parameters and choose the most basic BERT Chinese model for the model. batch_size = 2 epoches = 100 model = "bert-base-chinese" hidden_size = 768 n_class = 2 maxlen = 8 encode_layer=12 filter_sizes = [2, 2, 2] num_filters = 3 #data, construct some training data sentences = ["I like playing basketball", "This camera is very nice", "I had a lot of fun today", "I don't like you", "Too bad", " It's such a sad thing"] labels = [1, 1, 1, 0, 0, 0] # 1 positive, 0 negative. class MyDataset(Data.Dataset): def __init__(self, sentences, labels=None, with_labels=True,): self.tokenizer = AutoTokenizer.from_pretrained(model) self.with_labels = with_labels self.sentences = sentences self.labels = labels def __len__(self): return len(sentences) def __getitem__(self, index): # Selecting sentence1 and sentence2 at the specified index in the data frame sent = self.sentences[index] # Tokenize the pair of sentences to get token ids, attention masks and token type ids encoded_pair = self.tokenizer(sent, padding='max_length', # Pad to max_length truncation=True, # Truncate to max_length max_length=maxlen, return_tensors='pt') # Return torch.Tensor objects token_ids = encoded_pair['input_ids'].squeeze(0) # tensor of token ids attn_masks = encoded_pair['attention_mask'].squeeze(0) # binary tensor with "0" for padded values and "1" for the other values token_type_ids = encoded_pair['token_type_ids'].squeeze(0) # binary tensor with "0" for the 1st sentence tokens & amp; "1" for the 2nd sentence tokens if self.with_labels: # True if the dataset has labels label = self.labels[index] return token_ids, attn_masks, token_type_ids, label else: return token_ids, attn_masks, token_type_ids train = Data.DataLoader(dataset=MyDataset(sentences, labels), batch_size=batch_size, shuffle=True, num_workers=1) class TextCNN(nn.Module): def __init__(self): super(TextCNN, self).__init__() self.num_filter_total = num_filters * len(filter_sizes) self.Weight = nn.Linear(self.num_filter_total, n_class, bias=False) self.bias = nn.Parameter(torch.ones([n_class])) self.filter_list = nn.ModuleList([ nn.Conv2d(1, num_filters, kernel_size=(size, hidden_size)) for size in filter_sizes ]) def forward(self, x): # x: [bs, seq, hidden] x = x.unsqueeze(1) # [bs, channel=1, seq, hidden] pooled_outputs = [] for i, conv in enumerate(self.filter_list): h = F.relu(conv(x)) # [bs, channel=1, seq-kernel_size + 1, 1] mp = nn.MaxPool2d( kernel_size = (encode_layer-filter_sizes[i] + 1, 1) ) # mp: [bs, channel=3, w, h] pooled = mp(h).permute(0, 3, 2, 1) # [bs, h=1, w=1, channel=3] pooled_outputs.append(pooled) h_pool = torch.cat(pooled_outputs, len(filter_sizes)) # [bs, h=1, w=1, channel=3 * 3] h_pool_flat = torch.reshape(h_pool, [-1, self.num_filter_total]) output = self.Weight(h_pool_flat) + self.bias # [bs, n_class] return output # model class Bert_Blend_CNN(nn.Module): def __init__(self): super(Bert_Blend_CNN, self).__init__() self.bert = AutoModel.from_pretrained(model, output_hidden_states=True, return_dict=True) self.linear = nn.Linear(hidden_size, n_class) self.textcnn = TextCNN() def forward(self, X): input_ids, attention_mask, token_type_ids = X[0], X[1], X[2] outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids) # Return an output dictionary # Get the vector encoded by each layer # outputs.pooler_output: [bs, hidden_size] hidden_states = outputs.hidden_states # 13*[bs, seq_len, hidden] The first layer is the embedding layer and is not required cls_embeddings = hidden_states[1][:, 0, :].unsqueeze(1) # [bs, 1, hidden] # Extract the first token (cls vector) of each layer and put them together as the input of textcnn for i in range(2, 13): cls_embeddings = torch.cat((cls_embeddings, hidden_states[i][:, 0, :].unsqueeze(1)), dim=1) # cls_embeddings: [bs, encode_layer=12, hidden] logits = self.textcnn(cls_embeddings) return logits bert_blend_cnn = Bert_Blend_CNN().to(device) optimizer = optim.Adam(bert_blend_cnn.parameters(), lr=1e-3, weight_decay=1e-2) loss_fn = nn.CrossEntropyLoss() #train sum_loss = 0 total_step = len(train) for epoch in range(epoches): for i, batch in enumerate(train): optimizer.zero_grad() batch = tuple(p.to(device) for p in batch) pred = bert_blend_cnn([batch[0], batch[1], batch[2]]) loss = loss_fn(pred, batch[3]) sum_loss + = loss.item() loss.backward() optimizer.step() if epoch % 10 == 0: print('[{}|{}] step:{}/{} loss:{:.4f}'.format(epoch + 1, epoches, i + 1, total_step, loss.item())) train_curve.append(sum_loss) sum_loss = 0 # test bert_blend_cnn.eval() with torch.no_grad(): test_text = ['I don't like playing basketball'] test = MyDataset(test_text, labels=None, with_labels=False) x = test.__getitem__(0) x = tuple(p.unsqueeze(0).to(device) for p in x) pred = bert_blend_cnn([x[0], x[1], x[2]]) pred = pred.data.max(dim=1, keepdim=True)[1] if pred[0][0] == 0: print('negative') else: print('positive') pd.DataFrame(train_curve).plot() # loss curve
2.4 Test results and code links
Test single sample result:
loss curve:
Code link:
BERT-Blend-CNNjupyter version[4]
BERT-Blend-CNNpytorch version[5]
Reference materials
[1]《The Illustrated Transformer》: https://jalammar.github.io/illustrated-transformer/
[2] BERT text classification jupyter version: https://github.com/PouringRain/blog_code/blob/main/nlp/bert_classify.ipynb
[3]BERT text classification pytorch version: https://github.com/PouringRain/blog_code/blob/main/nlp/bert_classify.py
[4]BERT-Blend-CNNjupyter version: https://github.com/PouringRain/blog_code/blob/main/nlp/Bert_Blend_CNN.ipynb
[5]BERT-Blend-CNNpytorch version: https://github.com/PouringRain/blog_code/blob/main/nlp/bert_blend_cnn.py