Customization of HuggingFace model header

Recommended online tools:
Three.js AI Texture Development Kit –
YOLO synthetic data generator –
GLTF/GLB online editing –
3D model format online conversion –
Programmable 3D scene editor

In this article we’ll cover how to adapt HuggingFace’s model to your task, build a custom model header in Pytorch and connect it to the body of the HF model, and train the system end-to-end.

1. HF model head and model body

This is what a typical HF model looks like:

Why do I need to use Model Head and Model Body separately?

Some HF models are trained for downstream tasks (such as questioning or text classification) and include knowledge about the data on which their weights were trained.

Sometimes, especially when our task at hand contains little data or is domain specific (e.g. medical or sports specific tasks), we can use models trained on other tasks on HUB (not necessarily the same task as ours but belong to the same domain (e.g. sports or medicine) and use some validated knowledge to improve the performance of our model on our own tasks.

A very simple example is if say we have a small data set and say classify certain financial statements as positive or negative. However, we entered HF and found that many models have been trained on finance-related question and answer data sets, then we can use some layers of these models to improve our own tasks.
Another simple example is that a domain-specific model is trained on a huge data set and learns to classify text into 5 categories. Suppose we have a similar classification task on a completely different dataset in the same domain and only want to classify the data into 2 categories instead of 5. At this time, we can also reuse the model body and add our own model headers to enhance the specific domain knowledge of our own tasks.

This is a diagram of what we’re going to do:

2. Customized HF model head

Our task is simple, perform sarcasm detection from this dataset on Kaggle.

You can view the complete code here. In the interest of time, I have not included the preprocessing and some training details below, so be sure to check out the entire code notebook.

I will use a model trained on a large number of tweets, with 5 classification outputs for different sentiment types. We will extract the model body, add a custom layer (2 labels, sarcastic/not sarcastic) in pytorch, and train a new model.

Note: You can use any model in this example (not necessarily one trained for classification) as we will only use the model body and remove the model head.

This is our workflow:

I’ll skip the data preprocessing step and jump directly to the main class, but you can view the entire code at the link at the beginning of this section.

3. Tokenization and dynamic filling

Use the following code to convert text into tokens and fill them dynamically:

checkpoint = "cardiffnlp/twitter-roberta-base-emotion"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_len=512

def tokenize(batch):
  return tokenizer(batch["headline"], truncation=True,max_length=512)

tokenized_dataset = data.map(tokenize, batched=True)
print(tokenized_dataset)

tokenized_dataset.set_format("torch",columns=["input_ids", "attention_mask", "label"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The result is as follows:

DatasetDict({
    train: Dataset({
        features: ['headline', 'label', 'input_ids', 'attention_mask'],
        num_rows: 22802
    })
    test: Dataset({
        features: ['headline', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2851
    })
    valid: Dataset({
        features: ['headline', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2850
    })
})

4. Extract the model body and add our own layers

code show as below:

class CustomModel(nn.Module):
  def __init__(self,checkpoint,num_labels):
    super(CustomModel,self).__init__()
    self.num_labels = num_labels

    #Load Model with given checkpoint and extract its body
    self.model = model = AutoModel.from_pretrained(checkpoint,config=AutoConfig.from_pretrained(checkpoint, output_attentions=True,output_hidden_states=True))
    self.dropout = nn.Dropout(0.1)
    self.classifier = nn.Linear(768,num_labels) # load and initialize weights

  def forward(self, input_ids=None, attention_mask=None, labels=None):
    #Extract outputs from the body
    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

    #Add custom layers
    sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state

    logits = self.classifier(sequence_output[:,0,:].view(-1,768)) # calculate losses
    
    loss=None
    if labels is not None:
      loss_fct = nn.CrossEntropyLoss()
      loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    
    return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)

As you can see, we first inherit nn.Module in Pytorch and use AutoModel (from the transformers library) to extract the model body loaded with the specified checkpoint.

Please note that the forward() method returns TokenClassifierOutput, thus ensuring that the format of our output is consistent with the HF pre-trained model.

5. End-to-end training of new models

code show as below:

from tqdm.auto import tqdm

progress_bar_train = tqdm(range(num_training_steps))
progress_bar_eval = tqdm(range(num_epochs * len(eval_dataloader)))


for epoch in range(num_epochs):
  model.train()
  for batch in train_dataloader:
      batch = {k: v.to(device) for k, v in batch.items()}
      outputs = model(**batch)
      loss = outputs.loss
      loss.backward()

      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress_bar_train.update(1)

  model.eval()
  for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    progress_bar_eval.update(1)
    
  print(metric.compute())
  model.eval()

test_dataloader = DataLoader(
    tokenized_dataset["test"], batch_size=32, collate_fn=data_collator
)

for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

The result is as follows:

 0%| | 0/2139 [00:00<?, ?it/s]
  0%| | 0/270 [00:00<?, ?it/s]
{'f1': 0.9335347432024169}
{'f1': 0.9360090874668686}
{'f1': 0.9274912756882513}

As you can see, we achieved decent performance using this approach. Keep in mind that the purpose of this blog is not to analyze performance on this specific dataset, but to learn how to use a pretrained body and add a custom head.

6. Conclusion

In this article we saw how to add custom layers on top of the HF pre-trained model.

Some takeaways:

This technique is particularly useful in situations where we have a domain-specific dataset and want to leverage a model trained on the same domain (task-agnostic) to enhance performance on a small dataset.
We can choose a model that has been trained on a downstream task different from our own and still use the knowledge of the model body.
If your dataset is large and general enough, this may not be needed at all, in which case you can use AutoModeForSequenceCecrification or any other task solved using BERT. In fact, if that’s the case, I strongly recommend against building your own model header.

Original link: HF custom model head – BimAnt