AI modeling and training practice based on HF transformers

We often use scikit-learn to model data for both supervised and unsupervised learning tasks. We are familiar with object-oriented design, such as starting a class and calling subfunctions from the class. However, when I personally use PyTorch, I find design patterns that are similar but not the same as scikit-learn.

Recommended online tools: Three.js AI texture development kit – YOLO synthetic data generator – GLTF/GLB online editing – 3D model format online conversion – Programmable 3D scene editor

1. PyTorch and transformers

To train a model using PyTorch, you must create a class for the dataset and model and add inheritance for each class. For example, you need to create a class called TextDataset(Dataset), and the Dataset is a class in torch.utils.Dataset. To create a model, you must also create a class such as Classifier(nn.Module). Compared to scikit-learn, PyTorch allows you to build your own class and perform some operations in the class.

Let’s talk about the transformers developed by Hugging Face. I became aware of this library when I saw a post about new features being added by a research scientist I’m connected to on LinkedIn. However, I didn’t consider trying it right away. Then, the time came when I was curious about the BERT architecture and wanted to implement it.

If you are new to PyTorch. Try building some models with it, for example building a model for text classification or image classification to get started. PyTorch has a great documentation. After you are familiar with using PyTorch to train and evaluate models. Transformers can be used easily.

I started exploring Transformers just from their documentation. I follow the Hugging Face Twitter account, which frequently tweets about the latest updates about the library and development. This makes it easier for me to keep up with their updates. Check out the usage page in the Transformers documentation. It explains how to use a variety of applications in natural language processing, from sequence classification to neural machine translation.

In the deep learning community, we often emphasize pre-trained models. A pre-trained model that can be used on another similar task, also known as transfer learning. Hugging Face hosts pre-trained models from various developers. They have created a platform to share pre-trained models that you can also use for your own tasks.

After that, I was wondering, what if I want to train my own model using the BERT architecture without using a pretrained model (which means training the model from scratch)? Because most people discuss pre-trained models using BERT-base-uncased as an example from blog posts or research papers. Then I remembered that PyTorch is very different compared to Keras, which has fit and Predict functions.

2. Use transformers to train text classification BERT model

I started using PyTorch to train text classification from a Kaggle competition dataset, Real or Not? NLP with Disaster Tweets was a good choice to start. I used XGBoost and CatBoost and did some magic with data cleaning and feature extraction, but the scores just couldn’t go up. Then I think BERT will add the score to the leaderboard. It turns out that using bert-base-uncased improves the average F-score to 83% and puts us in the top 12% of the leaderboard!

To train the model from scratch, I created a function to generate the pipeline:

def generate_model(args, num_labels):
    
    config = AutoConfig.from_pretrained(
        args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=args.task_name,
        )
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path,
        do_lower_case=args.do_lower_case
    )
    model = AutoModelForSequenceClassification.from_config(
        config
    )
    
    return config, tokenizer, model

As you know, Transformers require three components to infer or train your model. AutoConfig is used to set up model and tokenizer configurations. AutoConfig can be changed to BertConfig or any other schema available in Transformers. Application and configuration of AutoTokenizer are the same. But the tokenizer here uses pre-trained, which means I use the tokenizer from bert-base-uncased. It already has its own vocabulary loaded, and you can view each token in vocab.txt, so I didn’t build my own vocabulary. The last component AutoModelForSequenceClassification is loaded from configuration because I want to start training from scratch. AutoModel can be changed to BertForSequenceClassification .

In the second step, I create the DisasterDataset class to load the dataset:

class DisasterDataset():
    def __init__(self, data_path, eval_path, tokenizer):
        d_data = pd.read_table(data_path, sep=',')
        d_eval = pd.read_table(eval_path, sep=',')
        
        row, col = d_data.shape
        d_train = d_data[:int(row * 0.8)]
        d_test = d_data[int(row*0.8):]

        d_train.reset_index(drop=True, inplace=True)
        d_test.reset_index(drop=True, inplace=True)
        
        self.tokenizer = tokenizer
        self.dataset = {'train': (d_train, len(d_train)),
                       'test': (d_test, len(d_test)),
                       'eval': (d_eval, len(d_eval))}
        self.num_labels = len(d_train.target.unique().tolist())
        self.set_split('train')
    
    def get_vocab(self):
        text = " ".join(self.data.text.tolist())
        text = text.lower()
        vocab = text.split(" ")
        with open('vocab.txt', 'w') as file:
            for word in vocab:
                file.write(word)
                file.write('\
')
        file.close()
        return 'vocab.txt'
        
    
    def set_split(self, split = 'train'):
        self.split = split
        self.data, self.length = self.dataset[split]
    
    def __getitem__(self, idx):
        x = self.data.loc[idx, "text"].lower()
        x = self.tokenizer.encode(x, return_tensors="pt")[0]
    
        if self.split != 'eval':
            y = self.data.loc[idx, "target"]
            return {'id': idx, 'x': x, 'y': y}
        else:
            id_ = self.data.loc[idx, "id"]
            return {'id': id_, 'x': x}
    
    def __len__(self):
        return self.length

The above script is used to build a class dataset. I didn’t use any pre-treatment, any cleaning. Just use plain text and tokenize it using BertWordPiece from the Tokenizers library:

for epoch in range(1, 101):
    running_loss = 0
    running_accuracy = 0
    running_loss_val = 0
    running_accuracy_val = 0
    
    start_time = time.time()
    # dataset class is assigned to dd variable
    dd.set_split('train')
    dataset = DataLoader(dd, batch_size=64, shuffle=True, collate_fn=padded)
    model.train()
    for batch_index, batch_dict in enumerate(dataset, 1):
        optimizer.zero_grad()

        x = batch_dict['x'].permute(1, 0)
        x = x.to(device)
        y = batch_dict['y'].to(device)

        output = model(x)[0]
        output = torch.softmax(output.squeeze(), dim=1)
        loss = criterion(output, y.type(torch.LongTensor).to(device))

        running_loss + = (loss.item() - running_loss) / batch_index

        accuracy = compute_accuracy(y, output)
        running_accuracy + = (accuracy - running_accuracy) / batch_index

        loss.backward()

        optimizer.step()
        
    dd.set_split('test')
    dataset = DataLoader(dd, batch_size=64, shuffle=True, collate_fn=padded)
    model.eval()
    for batch_index, batch_dict in enumerate(dataset, 1):

        x = batch_dict['x'].permute(1, 0)
        x = x.to(device)
        y = batch_dict['y'].to(device)

        output = model(x)[0]
        output = torch.softmax(output.squeeze(), dim=1)
        loss = criterion(output, y.type(torch.LongTensor).to(device))

        running_loss_val + = (loss.item() - running_loss_val) / batch_index

        accuracy = compute_accuracy(y, output)
        running_accuracy_val + = (accuracy - running_accuracy_val) / batch_index

Finally, the above script trains the model. I used the Adam optimizer with a learning rate of 0.0001 and PyTorch’s scheduler StepLR() with a step_size of 20 and a gamma of 0.01. For standard I use CrossEntropyLoss() . Even if the task is binary, binary cross-entropy is assumed to be used. But the model returns the probability of the class, using softmax as the activation function.

I run the script on compute engine using GPU Nvidia K-80 and get the results quite fast because I set running_accuracy > 90 and running_accuracy_val > 90 it will be stopped. The script does not run for 100 cycles to complete.

3. Conclusion

Finally, the BERT architecture is more useful and easier to use than the Transformers library. At the Jakarta Artificial Intelligence Research Center, an artificial intelligence research community in Jakarta, PyTorch and Transformers are primarily used to develop and run project experiments. We also try to build our own encoders to outperform existing encoders such as GPT and BERT architectures.

Original link: HF transformers modeling – BimAnt