BPE, WordPiece and SentencePiece

Jarkata focuses on IP jurisdiction: Shanghai

0.4252022.04.25 10:59:56 Word Count 2,726 Reading 10,692

1. Background and foundation

When using the GPT BERT model to input words, tokenization is often performed first. What are the specific goals and granularity of tokenize? Tokenize also has many categories, advantages and disadvantages. This article summarizes each method and practical cases.

The goal of tokenize is tocut the input text stream into a substring. Each substring has relatively complete semantics, making it easier to learn embedding expressions and use subsequent models.

tokenize has three granularities: word/subword/char

Word/Word Word is the most natural unit of language. For natural languages such as English, there are natural separators, such as spaces or some punctuation marks, etc., making it relatively easy to segment words. But for some East Asian texts, including Chinese, some kind of word segmentation algorithm is needed. By the way, in the Tokenizers library, the rule-based segmentation part uses two libraries: spaCy and Moses. If you make a vocabulary based on words, this vocabulary may be very large due to the existence of the long tail phenomenon. For example, the Transformer XL library uses a vocabulary of 267,000 words. This requires a huge embedding matrix to be able to store it. The embedding matrix is used to find the embedding vector of the token. This is a huge challenge for memory or video memory. A conventional vocabulary list generally does not exceed 50,000.
char/character, that is, the most basic characters, such as ‘a’, ‘b’, ‘c’ in English or ‘you’, ‘I’, ‘him’, etc. in Chinese. Generally speaking, the number of characters is a small amount. The problem with this is that because the number of characters is too small, when we learn embedding vectors for each character, each vector contains too much semantics, making it very difficult to learn.
subword/subword level, which is between characters and words. For example, ‘Transformers’ may be divided into two parts: ‘Transform’ and ‘ers’. This solution balances vocabulary size and semantic independence and is a relatively better solution. Its processing principle is that common words should remain intact, and rare words should be split into sub-words to share the token compression space.

2. Commonly used tokenize algorithms

The three most commonly used tokenize algorithms: BPE (Byte-Pair Encoding), WordPiece and SentencePiece

2.1 Byte-Pair Encoding (BPE) / Byte-level BPE

2.1.1 BPE

BPE, byte pair encoding. The core idea is to mergethe most frequently occurring pairs of subwords until the vocabulary reaches a predetermined size.

First, it relies on a pretokenizer to complete preliminary segmentation. The pretokenizer can be simply space-based or rule-based;
After word segmentation, the frequency of occurrence of each word is counted for subsequent calculations. For example, we counted the word frequency of 5 words

(“hug”, 10), (“pug”, 5), (“pun”, 12), (“bun”, 4), (“hugs”, 5)

Build a basic vocabulary, including all characters, namely:

[“b”, “g”, “h”, “n”, “p”, “s”, “u”]

According to the rules, we examine the basic character combinations of 2-gram and 3-gram respectively, and add high-frequency ngram combinations to the vocabulary in sequence until the vocabulary expression reaches the predetermined size. For example, we calculated that the frequency of occurrence of the three combinations ug/un/hug is 20, 16 and 15 respectively, and added them to the vocabulary.
The size of the final vocabulary = the size of the basic character vocabulary + the number of merged strings. For example, like GPT, its vocabulary size is 40478 = 478 (basic characters) + 40000 (merges). After adding it, our vocabulary becomes:

[“b”, “g”, “h”, “n”, “p”, “s”, “u”, “ug”, “un”, “hug”]

In actual use, if unknown characters are encountered, use to represent them.

2.1.2 Byte-level BPE

One problem with BPE is that if unicode is encountered, the basic character set may be very large. One way to deal with this is that we use one byte as a “character”, regardless of how many bytes the actual character set uses to represent a character. In this case, the size of the basic character set is locked at 256.

For example, the vocabulary size of something like GPT-2 is 50257 = 256 + + 50000 mergers, where is a special tag at the end of a sentence.

2.2 WordPiece

WordPiece, as its name is easy to understand, is a subword granular tokenize algorithm subword tokenization algorithm. Many famous Transformers models, such as BERT/DistilBERT/Electra, use it.

Its principle is very close to BPE. The difference is that when merging, it does not directly find the highest frequency combination, but looks for the merge that can maximize the likelihood of the training data. That is, each time it merges two strings A and B, it should have the largest

$\frac{P(AB)}{P(A)P(B)}$

value. After merging AB, all those originally cut into two tokens A + B will only retain one token AB. The maximum likelihood change in the entire training set is equal to

$\frac{P(AB)}{P(A)P(B)}$

Directly proportional.

2.3 Unigram

Different from BPE or WordPiece, the algorithm idea of Unigram is to start from a huge vocabulary list, and then gradually delete the trim down words until the size meets the predefined size.

The initial vocabulary list canuse all the words separated by the pre-segmentizer, plus all high-frequency substrings.
Theprinciple of deleting words from the vocabulary each time is to minimize the loss of predefinitions. During training, the formula for calculating loss is:

$Loss=-\sum^{N}_{i=1}log\left(\sum_{x\inS(x_i)}p(x)\right)$

Assume that all words in the training document are

$x_1;x_2,...,x_N$

, and the method of tokenizing each word is a set

$S(x_i)$

.
When a vocabulary is determined, a collection of methods for tokenizing each word

$S(x_i)$

is certain, and each method corresponds to a probability

$p(x)$

.
If some words are deleted from the vocabulary, the set of tokenized types of some words will become smaller, and the summation terms in log(*) will decrease, thus increasing the overall loss.

Each time, the Unigram algorithm will select 10% to 20% of the words from the vocabulary list that will cause the smallest increase in loss and delete them.
Generally, the Unigram algorithm is used in conjunction with the SentencePiece algorithm.

2.4 SentencePiece

SentencePiece, as the name suggests, treats a sentence as a whole and then breaks it into fragments, without retaining the concept of natural words. Generally, it treats space as a special character, and then uses BPE or Unigram algorithm to construct a vocabulary.

For example, XLNetTokenizer uses _ instead of spaces, which will be replaced with spaces during decoding.

Currently, in the Tokenizers library, all those that use SentencePiece are used in conjunction with the Unigram algorithm, such as ALBERT, XLNet, Marian and T5.

3. Segmentation examples and code analysis

3.1 BertTokenizer/ WordPiece

First try a BertTokenizer, which is based on the WordPiece algorithm. The vocabulary size of the base version is 21128

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
tokens = t.encode(...).tokens

The segmentation effect is:

Tokenizer:
Text: The problems of your past are your business. The problems of your future are my privilege.
Tokens: [UNK],pro,##ble,##ms,of,your,pa,##st,are,your,business,.,[UNK],pro,##ble,##ms,of, your,future,are,my,pr,##i,##vi,##le,##ge,.

Text: I don’t want to get involved in your past, that’s your business. I hope to be a part of your future, it is an honor.
Tokens: I don’t want to ask you about your past. That’s your thing. I hope to participate in your future, and this is my honor.

Text: Don’t make the user feel stupid.
Tokens: [UNK],[UNK],t,make,the,user,feel,st,##up,##id,.

Text: The Chinese Language Institute officially announced that the crown of “Chinese character with the most strokes” belongs to the character “龖(dá)”!
Tokens: Chinese language research institute, official style, announcement, [UNK], pen, painting, the most Chinese characters, [UNK], Laurel, crown, belongs to, [UNK], [UNK], (, [UNK],), [UNK], word,!

in,

In BertTokenizer, the ## symbol is used to represent non-beginning subwords. For example, the problems in the first sentence are split into three parts, pro/##ble/##ms;
Unused tokens such as punctuation marks and rare words are replaced by [UNK]
Chinese is basically divided into word forms, and no multi-character word forms are seen.

The word segmentation process and code analysis are as follows:
The BertTokenizer class relationship is as follows

View in code

Mainly did two things:

Perform basic tokenization of input text according to parameter control (basic_tokenizer)
For the single words that are segmented, segment them again (wordpiece_tokenizer)

basic_tokenizer divides sentences into words. You can still look at the code:

Special attention should be paid to line 401: If the tokenize_chinese_chars parameter is True, then all Chinese words will be cut into character levels! ! ! The never_split parameter passed will not prevent these Chinese words from being split.

wordpiece_tokenizer cuts words into character levels, such as doing->[‘do’, ‘###ing’].

The method here is to send a word into BERT for maximum matching (similar to the forward maximum matching algorithm of Jieba word segmentation). If there is already a match before, the following words will be added with ‘##’ .

As for Chinese, since it has been divided into character levels in the previous step, there will be no changes.

3.2 T5Tokenizer / SentencePiece

The T5 model is based on SentencePiece, let’s take a look at its segmentation effect. The vocabulary size of this version I am using is 250112.

Tokenizer:
Text: The problems of your past are your business. The problems of your future are my privilege.
Tokens: The, problems, of, your, past, are, your, business,., The, problems, of, your, future, are, my, ,privilege,.

Text: I don’t want to get involved in your past, that’s your business. I hope to be a part of your future, it is an honor.
Tokens: I don’t want to ask about your past, that’s your business. ,I ,hope ,to ,participate ,in ,your ,future, ,it ,is ,my ,honor ,fortunate.

Text: Don’t make the user feel stupid.
Tokens: Don,’,t, make, the, user, feel, stupid,.

Text: The Chinese Language Institute officially announced that the crown of “Chinese character with the most strokes” belongs to the character “龖(dá)”!
Tokens: ,The Chinese ,Language ,Research Institute ,officially ,announced that “,the most ,Chinese ,characters with ,strokes,”, ,the ,laurel crown ,belongs to “,<0xE9>,<0xBE>, <0x96>,(,dá,),”,word,!

in,

Most obviously, you can see that the underscore is introduced, replacing spaces and special symbols at the beginning of sentences.
You can see some multi-character words in Chinese, such as “future”, “research institute”, etc., but some words actually do not conform to the general word segmentation habits, such as “things”, “I don’t”, etc.
The rare characters are split into three tokens in the form of basic bytes.

Reference link

Introduction to BERT, XLNET word segmentation methods bpe, unigram, etc._Peng Wei’s Blog-Programmer Homestead
Detailed explanation of tokenizer category in NLP BERT GPT and other models
Detailed explanation of BPE algorithm

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 360325 people are learning the system