Week 16: Transformer architecture encoder code implementation (pytorch)

Week 16: Transformer architecture code implementation (pytorch)

  • Summary
  • Abstrct
  • 1. Code implementation of the core part of Transformer encoder
    • 1.1 Import dependent library files
    • 1.2 Set word list size and maximum sequence length
    • 1.3 Word index forms source sentence and target sentence
    • 1.4 Construct word embedding word list
    • 1.5 Construct position embedding
    • 1.6 Construct the encoder’s self-attention mask
    • 1.7 Transformer test code
  • Summarize

Abstract

In the field of natural language processing, the Transformer model is a very popular deep learning model. Its core part is the Transformer encoder, which encodes the input sequence through a self-attention mechanism and a feed-forward neural network. This article will introduce the core implementation code of the Transformer encoder. By implementing the core code of the Transformer encoder, we can deeply understand the principles and implementation details of the Transformer model. This is of great significance for further research and application of Transformer models in natural language processing tasks. I hope this blog can help readers better understand the core implementation code of the Transformer encoder.

Abstrct

In the field of natural language processing, the Transformer model is a very popular deep learning model. Its core part is the Transformer encoder, which realizes the encoding of input sequences through the self-attention mechanism and feed-forward neural networks. In this paper , we will introduce the core implementation code of Transformer encoder. By implementing the core part of the Transformer encoder code, we can deeply understand the principle and implementation details of the Transformer model. This is important for further research and application of the Transformer model in natural language processing tasks. I hope this blog can help readers better understand the core implementation code of the Transformer encoder.

1. Code implementation of the core part of Transformer encoder

1.1 Import dependent library files

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

1.2 Set word list size and maximum sequence length

batch_size = 2

# Word list size
max_num_src_words = 8
max_num_tgt_words = 8
model_dim = 8

#Maximum length of sequence
max_src_seq_len = 5
max_tgt_seq_len = 5
max_position_len = 5

1.3 Word index constitutes source sentence and target sentence

  • torch.Tensor([2, 4]): Generates two tensors, the first tensor has a length of 2 and the second tensor has a length of 4.
  • src_seq = [ torch.randint(1, max_num_src_words, (L) ) for L in src_len ]: Generate two tensors. The values in the tensors are 1~8, and the lengths are set by src_len. length, after printing src_seq, the output is as follows:
  • src_seq = [F.pad(torch.randint(1, max_num_src_words, (L,)), (0, max_src_seq_len - L)) for L in src_len]: F.pad is added at this time The purpose is to set the remaining blank position elements in the tensor to 0, as shown in the figure:
  • Finally, merge the two tensors, src_seq = torch.cat([torch.unsqueeze(F.pad(torch.randint(1, max_num_src_words, (L,)), (0, max_src_seq_len - L)), 0 ) for L in src_len]), as shown in the figure:

Complete code:

src_len = torch.Tensor([2, 4]).to(torch.int32)
tgt_len = torch.Tensor([4, 3]).to(torch.int32)
src_seq = torch.cat(
    [torch.unsqueeze(F.pad(torch.randint(1, max_num_src_words, (L,)), (0, max_src_seq_len - L)), 0) for L in src_len])
tgt_seq = torch.cat(
    [torch.unsqueeze(F.pad(torch.randint(1, max_num_tgt_words, (L,)), (0, max_tgt_seq_len - L)), 0) for L in tgt_len])

1.4 Construct word embedding word list

  • nn.Embedding(a, b) will generate an embedding table of size (a, b), where a represents the input dimension of the embedding layer and b represents the dimension of the embedding vector. When using this embedding layer, you can obtain the corresponding embedding vector by passing in the input index value (usually an integer label).

Complete code:

src_embedding_table = nn.Embedding(max_num_src_words + 1, model_dim)
print(src_embedding_table)
tgt_embedding_table = nn.Embedding(max_num_tgt_words + 1, model_dim)
src_embedding = src_embedding_table(src_seq)
print(src_embedding)
tgt_embedding = tgt_embedding_table(tgt_seq)

Output result:

1.5 Construct position embedding

  • torch.arange(max_position_len).view(-1, 1): torch.arange(max_position_len) produces a one-dimensional tensor containing integers from 0 to 4. Then, view(-1, 1) reshapes the one-dimensional tensor into a two-dimensional tensor, where -1 means that the size of this dimension is automatically calculated based on the size of other dimensions, and 1 means that the size of this dimension is 1. Print out the following picture:

  • torch.pow(10000, torch.arange(0, 8, 2).reshape((1, -1)) / model_dim): Use PyTorch’s torch.pow function to calculate the power of 10000 , and calculates a two-dimensional tensor i_mat with a shape of (1, -1) based on the specified parameters. Specifically, torch.arange(0, 8, 2) will generate a starting from 0 and incrementing by a step of 2 , a one-dimensional tensor that does not contain 8, that is, [0, 2, 4, 6]. This one-dimensional tensor is then converted to a two-dimensional tensor of shape (1, -1) by calling the reshape function, where -1 means that the size of this dimension is automatically calculated based on the size of the other dimensions. Here, convert it to a 2D tensor of shape (1, 4). Next, divide this 2D tensor by model_dim, which is the dimensional size of the model. Finally, the result is assigned to i_mat by exponentiating 10000. This results in a two-dimensional tensor i_mat of shape (1, 4), which contains values of 10000 in the specified power. The output is:

  • pe_embedding_table = torch.zeros(max_position_len, model_dim) : The torch.zeros function is used to create a tensor in which the values of all elements are initialized to zero. Here, using torch.zeros(max_position_len, model_dim), a 5-row, 8-column 2D tensor is created in which all elements have a value of zero. The output is as follows:

  • pe_embedding_table[:, 0::2] = torch.sin(pos_mat / i_mat): pe_embedding_table[:, 0::2] means selecting all rows of pe_embedding_table, and selecting starting from column 0, Select columns in steps of 2. Then, by calling torch.sin(pos_mat / i_mat), the result of dividing pos_mat by i_mat is calculated and the sine function is applied to the result. These calculated values will be assigned to the corresponding columns in pe_embedding_table. The output is as follows:

  • pe_embedding = nn.Embedding(max_position_len, model_dim): where max_position_len represents the input dimension of the embedding layer, and model_dim represents the dimension of the embedding vector.

  • nn.Parameter(pe_embedding_table, requires_grad=False): Convert the previously calculated pe_embedding_table to nn.Parameter through nn.Parameter(pe_embedding_table, requires_grad=False) and assign it to the weight of pe_embedding. requires_grad=False means that the gradient of the weights is not calculated, i.e. these weights are fixed.

  • pe_embedding = nn.Embedding(max_position_len, model_dim)
    pe_embedding.weight = nn.Parameter(pe_embedding_table, requires_grad=False)
    The purpose of this code is to use the previously calculated position encoding table pe_embedding_table as a weight in the nn.Embedding embedding layer to represent the position encoding associated with each position in the embedding model. In this way, the model can use these positional encodings to process positional information for each position in the sequence. pe_embedding.weight is consistent with the pe_embedding_table result we generated earlier, and the output result is:

  • src_pos = torch.cat([torch.unsqueeze(torch.arange(max(src_len)), 0) for _ in src_len]).to(torch.int32): We first use max() The function gets the maximum value in the list src_len, then, using list comprehensions and loops, torch.arange() is used to generate a one-dimensional tensor from 0 to max_src_len for each element in src_len, and is passed in torch.unsqueeze() Add a dimension to the 0th dimension. Next, use the torch.cat() function to concatenate all tensors according to the 0th dimension to obtain the final src_pos tensor. The output is:

  • src_pe_embedding = pe_embedding(src_pos): We use the pe_embedding embedding layer to embed the src_pos tensor and save the result as src_pe_embedding. Specifically, we assume that pe_embedding is a defined embedding layer object, and its weight has been set. Then, we use pe_embedding to map each index in the src_pos tensor to the corresponding embedding vector, and obtain the embedded tensor src_pe_embedding. , the output result is:

Complete code:

pos_mat = torch.arange(max_position_len).view(-1, 1)
i_mat = torch.pow(10000, torch.arange(0, 8, 2).reshape((1, -1)) / model_dim)
pe_embedding_table = torch.zeros(max_position_len, model_dim)
pe_embedding_table[:, 0::2] = torch.sin(pos_mat / i_mat)
pe_embedding_table[:, 1::2] = torch.cos(pos_mat / i_mat)
pe_embedding = nn.Embedding(max_position_len, model_dim)
pe_embedding.weight = nn.Parameter(pe_embedding_table, requires_grad=False)
src_pos = torch.cat([torch.unsqueeze(torch.arange(max(src_len)), 0) for _ in src_len]).to(torch.int32)
tgt_pos = torch.cat([torch.unsqueeze(torch.arange(max(tgt_len)), 0) for _ in tgt_len]).to(torch.int32)
src_pe_embedding = pe_embedding(src_pos)
tgt_pe_embedding = pe_embedding(tgt_pos)

1.6 Construct the encoder’s self-attention mask

  • valid_encoder_pos = [F.pad(torch.ones(L),(0, max(src_len)-L))for L in src_len]: For each element L in src_len, F.pad What (torch.ones(L), (0, max(src_len)-L)) does is create an all-1 tensor of size max(src_len) and pad it with zeros on the right so that its length is max(src_len ). Where (0, max(src_len)-L) is the padding size, which means padding 0 elements (0) or max(src_len)-L elements (max(src_len)-L) on the right side of the tensor. Overall, the purpose of this code is to create a list valid_encoder_pos that contains the tensor corresponding to each element in src_len. These tensors represent a mask of valid encoder-side positions, where elements are 1 at corresponding positions and 0 elsewhere. This mask can be used to mask attention calculations or other operations at invalid locations in the encoder. The output is:
  • valid_encoder_pos = torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0, max(src_len)-L))),0) for L in src_len]), 2): After generating two tensors according to the above, merge the two tensors through torch.cat, use the torch.unsqueeze function to add a dimension to the second dimension, and merge the two-dimensional tensor The quantity becomes a three-dimensional tensor. This represents a valid position mask for each sequence and remains consistent with the shape of other three-dimensional tensors. The output is:
  • valid_encoder_pos_matrix = torch.bmm(valid_encoder_pos, valid_encoder_pos.transpose(1,2)): torch.bmm() function performs batch matrix multiplication on the tensor named valid_encoder_pos and generates a tensor named valid_encoder_pos_matrix The result tensor. Specifically, valid_encoder_pos is a three-dimensional tensor of shape (num_sequences, max_sequence_length, 1) that stores the encoder-side valid position mask. valid_encoder_pos.transpose(1,2) transposes the valid_encoder_pos tensor and exchanges the 1st and 2nd dimensions. As shown in the figure below, the first one is the valid_encoder_pos tensor, and the second one is the tensor after valid_encoder_pos.transpose(1,2):

    The torch.bmm() function performs batch matrix multiplication of two input tensors. The popular understanding is to multiply a 4 * 1 matrix by a 1 * 4 matrix, and the final result is a 4 * 4 matrix. The final output is:
  • invalid_encoder_pos_matrix = 1-valid_encoder_pos_matrix: The invalid encoding matrix is (1 – the element value of each position in the matrix), and the output result is as follows:
  • invalid_encoder_pos_matrix.to(torch.bool): Use bool value to represent the element value in the invalid encoding matrix
  • score = torch.randn(batch_size, max(src_len),max(src_len) ): Use PyTorch’s torch.randn() function to generate a tensor named score, which contains random generation value. (Note: We will not train the model here, the score value will be randomly generated) The shape of the tensor is (batch_size, max(src_len), max(src_len)), that is, a size In the batch of batch_size, each sample is a two-dimensional tensor of max(src_len) × max(src_len). The output is:
  • masked_score = score.masked_fill(mask_encoder_self_attention, -np.inf): The function of score.masked_fill(mask_encoder_self_attention, -np.inf) is to replace the position in the score tensor that satisfies the mask_encoder_self_attention Boolean mask condition with a negative value Infinity (-np.inf) as fill value. This operation is often used in the mask operation in the self-attention mechanism. The purpose is to set the score of the position that does not need to be considered to a very small value so that it can be ignored in subsequent calculations. By setting the scores of specific positions to negative infinity, the weights of these positions will approach zero during the calculation of the softmax function, thereby masking or ignoring them. The output is:
  • prob = F.softmax(masked_score, -1): softmax normalization operation, the output result is:

    Complete code:
valid_encoder_pos = torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0, max(src_len)-L)),0) for L in src_len]), 2)
valid_encoder_pos_matrix = torch.bmm(valid_encoder_pos, valid_encoder_pos.transpose(1,2))
invalid_encoder_pos_matrix = 1-valid_encoder_pos_matrix
mask_encoder_self_attention = invalid_encoder_pos_matrix.to(torch.bool)
score = torch.randn(batch_size, max(src_len),max(src_len) )
masked_score = score.masked_fill(mask_encoder_self_attention, -np.inf)
prob = F.softmax(masked_score, -1)

1.7 Transformer test code

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

# Regarding word embedding, take sequence modeling as an example
# Consider source sentence and target sentence
# Construct a sequence, the characters of the sequence are expressed in the form of their index in the vocabulary
# word embedding finishing process
# 1. We need to convert the original text into numbers. This number is the position of each word in each sentence in the dictionary. The 0th position should be given to padding.
# 2. Build batch
# 3. Build embedding

batch_size = 2

# Word list size
max_num_src_words = 8
max_num_tgt_words = 8
model_dim = 8

#Maximum length of sequence
max_src_seq_len = 5
max_tgt_seq_len = 5
max_position_len = 5

# Original sequence length and number, two sentences, the first sentence length is 2, the second sentence length is 4
src_len = torch.Tensor([2, 4]).to(torch.int32)

# Target sequence length and number, two sentences, the first sentence length is 4, the second sentence length is 3
tgt_len = torch.Tensor([4, 3]).to(torch.int32)

# The word index constitutes the source sentence and the target sentence, builds the batch, and performs padding. The default value is 0
src_seq = torch.cat(
    [torch.unsqueeze(F.pad(torch.randint(1, max_num_src_words, (L,)), (0, max_src_seq_len - L)), 0) for L in src_len])
tgt_seq = torch.cat(
    [torch.unsqueeze(F.pad(torch.randint(1, max_num_tgt_words, (L,)), (0, max_tgt_seq_len - L)), 0) for L in tgt_len])
# print(src_seq)

# Construct word embedding, word list
src_embedding_table = nn.Embedding(max_num_src_words + 1, model_dim)
tgt_embedding_table = nn.Embedding(max_num_tgt_words + 1, model_dim)
src_embedding = src_embedding_table(src_seq)
tgt_embedding = tgt_embedding_table(tgt_seq)

# Construct position embedding
pos_mat = torch.arange(max_position_len).view(-1, 1)
i_mat = torch.pow(10000, torch.arange(0, 8, 2).reshape((1, -1)) / model_dim)
pe_embedding_table = torch.zeros(max_position_len, model_dim)
pe_embedding_table[:, 0::2] = torch.sin(pos_mat / i_mat)
pe_embedding_table[:, 1::2] = torch.cos(pos_mat / i_mat)
pe_embedding = nn.Embedding(max_position_len, model_dim)
pe_embedding.weight = nn.Parameter(pe_embedding_table, requires_grad=False)
src_pos = torch.cat([torch.unsqueeze(torch.arange(max(src_len)), 0) for _ in src_len]).to(torch.int32)
tgt_pos = torch.cat([torch.unsqueeze(torch.arange(max(tgt_len)), 0) for _ in tgt_len]).to(torch.int32)
src_pe_embedding = pe_embedding(src_pos)
tgt_pe_embedding = pe_embedding(tgt_pos)

# Construct the encoder's self-attention mask
valid_encoder_pos = torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0, max(src_len)-L)),0) for L in src_len]),2)

valid_encoder_pos_matrix = torch.bmm(valid_encoder_pos, valid_encoder_pos.transpose(1,2))
invalid_encoder_pos_matrix = 1-valid_encoder_pos_matrix
mask_encoder_self_attention = invalid_encoder_pos_matrix.to(torch.bool)

score = torch.randn(batch_size, max(src_len),max(src_len) )
masked_score = score.masked_fill(mask_encoder_self_attention, -np.inf)
prob = F.softmax(masked_score, -1)

# Construct intra-attention mask
# [Note] intra-attention is the connection between the encoder and the decoder.
valid_encoder_pos = torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0, max(src_len)-L)),0) for L in src_len]),2)
valid_decoder_pos = torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0, max(tgt_len)-L)),0) for L in tgt_len]),2)

# Correlation between original sequence and target sequence
valid_cross_pos_matrix = torch.bmm(valid_decoder_pos, valid_encoder_pos.transpose(1,2))
invalid_cross_pos_matrix = 1 - torch.bmm(valid_decoder_pos, valid_encoder_pos.transpose(1,2))
mask_cross_attention = invalid_cross_pos_matrix.to(torch.bool)
#Construct the mask of decoder self-attention
valid_decoder_tri_matrix = torch.cat(
    [
        torch.unsqueeze(
            F.pad(
                torch.tril(
                    torch.ones(L, L)
                ) , (0,max(tgt_len)-L , 0 , max(tgt_len)-L) ),0) for L in tgt_len
    ]
)
invalid_decoder_tri_matrix = (1-valid_decoder_tri_matrix).to(torch.bool)
# score = torch.randn(batch_size, max(tgt_len), max(tgt_len))
# masked_score = score.masked_fill(invalid_decoder_tri_matrix, -1e9)
# prob = F.softmax(masked_score, -1)

# Build scaled self-attention
def scaled_dot_product_attention(Q,K,V,attn_mask):
    score = torch.bmm(Q,K.transpose(-2,-1))/torch.sqrt(model_dim)
    masked_score = score.masked_fill(atten_mask, -1e9)
    prob = F.softmax(masked_score, -1)
    context = torch.bmm(prob, V)
    return context

Summary

This week, as various degree courses are coming to an end, I spent most of my time studying degree courses. I did not learn too much about the code implementation of Transformer, and did not achieve the expected goals. I will arrange my time reasonably next week, so try my best.