13-NLP Bert multi-classification implementation case (data acquisition and processing)

Article directory

  • Preface knowledge
  • 1. Code interpretation
    • 1.1 Code display
    • 1.2 Process introduction
    • 1.3 Debug method introduced line by line
  • 3. Knowledge points
    • 3.2 Code questions
      • 1. Use tokenizer.encode() to convert each line of text into token IDs in the BERT vocabulary. Does it include word embeddings?

Preface knowledge

An article can belong to multiple categories at the same time. However, with our past multi-classification, although the result is multiple categories, each sample can only belong to one category.

For the image below, previously, for the output layer, the output layer had 5 neurons. We think there are 5 categories, and softmax will generate the probabilities of the 5 categories. We take the one with the highest probability as the prediction result.

But now, the following network structure needs to be slightly adjusted. For the last layer, that is, the [output layer], each neuron (single neuron) has a known corresponding sigmoid function. After each neuron is connected to the sigmoid , returns 0 or 1. If it is below 0.5, it means that the current category is not selected. If it is above 0.5, it means it is selected. In this way, each category has two states: [selected] and [unselected]. Taken together, it is a Multi-classification neural network.

The following uses the data of the knowledge graph to do relevant analysis. The triplet reveals the relationship between two entities, or entities and attributes, and this relationship is the label of P.
O is what P of S, for example Lu Xun is the director of "The Scream"; Wu Jing is the director of "Wolf Warrior".

1. Code interpretation

1.1 Code display

import json
import numpy as np
from tqdm import tqdm

bert_model = "bert-base-chinese"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(bert_model)
#Relationship labels of all p in spo (text --> p label)
p_entitys = ['husband', 'release time', 'host', 'starring', 'protagonist', 'composer', 'author', 'lyricist', 'production company', 'birthplace', 'date of birth' ', 'Founder', 'Producer', 'No.', 'Guest', 'Nationality',
             'Wife', 'Word', 'Director', 'Album', 'Adapted from', 'Dynasty', 'Singer', 'Mother', 'Graduation School', 'Ethnicity', 'Father', 'ancestral home' ', 'screenwriter', 'chairman', 'height', 'serial website']

max_length = 300
train_list = []
label_list = []
with open(file="../data/train_data.json", mode='r', encoding='UTF-8') as f:
    data = json.load(f)
    for line in tqdm(data):
        text = tokenizer.encode(line['text'])
        token = text[:max_length] + [0] * (max_length - len(text))
        train_list.append(token)
        # Get the standard answer of the current text (p in spo)
        new_spo_list = line['new_spo_list']
        label = [0.] * len(p_entitys) # Determine the label length of label
        for spo in new_spo_list:
            p_entity = spo['p']['entity']
            label[p_entitys.index(p_entity)] = 1.0
        label_list.append(label)
train_list = np.array(train_list)
label_list = np.array(label_list)
val_train_list = []
val_label_list = []
#Load and preprocess the validation set
with open('../data/valid_data.json', 'r', encoding="UTF-8") as f:
    data = json.load(f)
    for line in tqdm(data):
        text = line["text"]
        new_spo_list = line["new_spo_list"]
        label = [0.] * len(p_entitys)
        for spo in new_spo_list:
            p_entity = spo["p"]["entity"]
            label[p_entitys.index(p_entity)] = 1.
        token = tokenizer.encode(text)
        token = token[:max_length] + [0] * (max_length - len(token))
        val_train_list.append((token))
        val_label_list.append(label)
val_train_list = np.array(val_train_list)
val_label_list = np.array(val_label_list)

1.2 Process introduction

The overall flow and purpose of this code is to prepare for a natural language processing task on Chinese text. It seems to be for a text classification or relationship extraction task. Specific steps are as follows:

  1. Import necessary libraries:

    • json: used to process JSON data format.
    • numpy: Scientific computing library for efficient processing of array operations.
    • tqdm: used to display the progress bar during loop operations.
  2. Specify the type of BERT model (bert-base-chinese), used for Chinese text.

  3. Use AutoTokenizer in the transformers library to load the tokenizer of the specified BERT model.

  4. Defines a set of Chinese entity relationship tags (p_entitys) that appear to be tags for relationships between specific entities in text.

  5. Set a maximum sequence length (max_length) to 300 to limit the length of text processed.

  6. Initialize two empty lists train_list and label_list to store features and labels of training data.

  7. Load and process training data (train_data.json):

    • Use json.load() to read the file content.
    • Use tokenizer.encode() to convert each line of text into token IDs in the BERT vocabulary.
    • If the text length is shorter than max_length, it is padded with 0s.
    • Parse each text’s new_spo_list to build a tag vector where the corresponding position is 1 if the relationship exists in the text and 0 otherwise.
  8. Convert train_list and label_list to NumPy arrays for further processing.

  9. The validation set is processed the same as the training set and stored in val_train_list and val_label_list.

The purpose of the code is to convert Chinese text and its corresponding Entity Relationship Tags into a numerical format that can be accepted by the machine learning model. This process is often part of text preprocessing in natural language processing tasks, especially when using pre-trained language models such as BERT.

The work you need to complete may include:

  • Make sure you have the environment and libraries configured correctly to run this code.
  • If this is your first run, make sure you download the bert-base-chinese model and tokenizer.
  • Prepare train_data.json and valid_data.json data files.
  • Run the code and monitor the progress bar to make sure the data is being processed correctly.

1.3 Debug method is introduced line by line

The function of the code is explained line by line below:

import json
import numpy as np
from tqdm import tqdm

These three lines of code are import modules. json is used to process data in JSON format. numpy is a widely used scientific computing library. tqdm is A library for displaying loop progress, giving user feedback during long loops.

bert_model = "bert-base-chinese"

This line of code defines the variable bert_model and sets it to "bert-base-chinese", which refers to the Chinese pre-trained version of the BERT model.

from transformers import AutoTokenizer

Import AutoTokenizer from the transformers library, which is a class provided by Hugging Face for automatically obtaining and loading pre-trained tokenizers.

tokenizer = AutoTokenizer.from_pretrained(bert_model)

A tokenizer object is created using the from_pretrained method. This tokenizer will be loaded according to the model specified by the bert_model variable.

p_entitys = [...]

Here is a list of all possible relationship labels, used to map relationships in the text into a fixed-length label vector.

max_length = 300

A variable max_length is defined and set to 300, which is used for the maximum length of the subsequent text sequence.

train_list = []
label_list = []

Initialize two lists train_list and label_list, which are used to store processed text data and corresponding label data respectively.

with open(file="../data/train_data.json", mode='r', encoding='UTF-8') as f:

Open the training data file train_data.json, encoded in read-only mode ('r') and UTF-8.

 data = json.load(f)

Load the entire JSON file content into the variable data.

 for line in tqdm(data):

Iterate over each item in data, and tqdm will display a progress bar for this loop.

 text = tokenizer.encode(line['text'])

Use a tokenizer to encode the text into BERT’s input ID.

 token = text[:max_length] + [0] * (max_length - len(text))

This line of code intercepts or pads the encoded text to max_length length.

 train_list.append(token)

Add the processed text to the train_list list.

 new_spo_list = line['new_spo_list']

Extracts the relationship list of the current entry.

 label = [0.] * len(p_entitys)

Create a vector of zeros with the same length as the number of relationship labels.

 for spo in new_spo_list:

Traverse each relationship of the current item.

 p_entity = spo['p']['entity']

Extract entities from relationships.

 label[p_entitys.index(p_entity)] = 1.0

In the label vector, set the index position of the corresponding entity to 1.0, indicating that the relationship appears in the text.

 label_list.append(label)

Add the constructed label vector to label_list.

train_list = np.array(train_list)
label_list = np.array(label_list)

Convert the list to a NumPy array for subsequent processing.

The next code block is similar to the previous one, but operates on the validation data set valid_data.json.

val_train_list = []
val_label_list = []

Initialize two lists for the validation set.

with open('../data/valid_data.json', 'r', encoding="UTF-8") as f:

Open the verification data file.

The next block of code is the same as when working with training data, loading the data, encoding the text, generating label vectors, and then adding them to the corresponding validation set list.

val_train_list = np.array(val_train_list)
val_label_list = np.array(val_label_list)

Finally, similarly, convert the list of validation sets into a NumPy array.

This code as a whole is used to process and prepare data so that it is suitable for input into a neural network model. It not only encodes text but also constructs labels corresponding to it, which is necessary when training and validating machine learning models.

3. Knowledge points

3.2 Code questions

1. Use tokenizer.encode() to convert each line of text into token IDs in the BERT vocabulary. Does it include word embeddings?

When using tokenizer.encode() to convert text into token IDs in the BERT vocabulary, word embeddings are not directly included. This process is primarily a text preprocessing step that involves segmenting the text into tokens and converting these tokens into numerical IDs that the model can understand. This process is explained in detail below:

  1. Tokenization:

    • First, the tokenizer.encode() function splits the given text string into tokens. In the case of BERT, this typically includes word-level segmentation as well as subword unit segmentation.
  2. Convert to Token IDs:

    • These tokens are then converted into corresponding numerical IDs in the vocabulary. The BERT model has a predefined vocabulary, and each token is mapped to a unique ID.
  3. The role of word embeddings:

    • Word embeddings are used in subsequent stages of the model. When these token IDs are fed into the BERT model, the first layer of the model (usually an embedding layer) converts these numeric IDs into vector form word embeddings.
    • Word embeddings are vectors in a high-dimensional space, and each vector represents a token in the input text. These vectors are learned during the training process and can capture the semantic information of words.
  4. Distinguish between Tokenization and Embedding:

    • It is important to distinguish between tokenization (and the token IDs it generates) and word embeddings. Tokenization is simply a pre-processing step that converts text into a form that the model can understand. Word embeddings are vectors used internally by the model to represent the semantics of these tokens.

In summary, tokenizer.encode() is only responsible for converting text into token IDs and does not involve the creation of word embeddings. Word embeddings are generated internally as part of the model’s processing of these token IDs.

Give a python case, including input and output content
Here is a simple Python example that demonstrates how to use the BERT tokenizer to convert text into token IDs, and then use a basic BERT model to convert these token IDs into word embeddings. Please note that this example requires the transformers library, which can be installed via the pip install transformers command.

Example: Using BERT Tokenizer and Model

Input
Text: “Hello, how are you?”

Python code

from transformers import BertTokenizer, TFBertModel

#Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

#Text input
text = "Hello, how are you?"

# Use tokenizer to convert text to token IDs
input_ids = tokenizer.encode(text, return_tensors='tf')

# Output token IDs
print("Token IDs:", input_ids)

# Use the model to get word embeddings
outputs = model(input_ids)
last_hidden_states = outputs.last_hidden_state

# Output the shape of the word embedding
print("Shape of Embeddings:", last_hidden_states.shape)

Output

  1. Token IDs: A series of numbers representing the ID of each token.
  2. Shape of Embeddings: This will show the dimensions of the word embeddings, usually (batch_size, sequence_length, hidden_size).

Explanation

  • This code first loads the BERT tokenizer and model.
  • It then uses a tokenizer to convert the given text string “Hello, how are you?” into a series of token IDs.
  • Finally, these token IDs are passed into the BERT model, which returns word embeddings for each token.
  • The output contains token IDs and shape information of word embeddings.

Note that running this code requires the transformers library to be installed in a Python environment, and an internet connection is required to download the model and tokenizer.