Big Language Model 10 SentencePiece

Tokenizer

Large language models such as GPT-3/4 and LlaMA/LlaMA2 all use tokens as the input and output of the model. The input is text, and then the text is converted into a token (positive integer), and then from a string of tokens (corresponding to the text) Predict the next token.
Enter the tokenizer provided by the OpenAI official website to see the method used by GPT-3 tokenizer. Here we take Hello World as an example.

There are 30 tokens in total. English words are generally represented by separate tokens. Different tokens are also distinguished by case, such as Hello and hello. In addition, some words preceded by spaces will also be encoded separately, which will make encoding the entire sentence more efficient. High (this will save the encoding of each space), for Chinese tokenization, two to three IDs (represented by positive integers) will be used, such as the Chinese and English ones above! .

In whitespace-delimited languages such as English, text is pre-tokenized, typically using subword units that do not cross word boundaries. On the other hand, in languages such as Japanese and Chinese, subwords are extracted from unsegmented text without pre-tokenization.

What is SentencePiece

SentencePiece is Google’s open source project for extracting vocabulary tokenizers from NLP scenarios. The purpose of SentencePiece is to maximize vocabulary information encoding (word frequency + diversity) subword encoding under the premise of a given vocabulary size. For example, the two words simple and simplify in English have the same meaning, which are changes to adapt to grammatical needs, so using independent tokens to encode these two words is redundant. Another scenario is that word frequency Different, there are commonly used Chinese characters, and there are also commonly used English words. Words that appear less frequently use independent tokens during training compared with other high-frequency words. Because they appear too few, the information is easily lost in the process of deep learning (information compression).

BPE encoding algorithm

Byte Pair Encoding is the most commonly used Tokenizer method in large language models. An intuitive way to tokenize is:
Treating each word as a token and then numbering it is in line with human language habits, but this is not an efficient encoding method. This is because a language usually has tens to hundreds of thousands of words, and now All large language models support multiple countries. If each word is encoded independently, this requires the language model to select one from a vocabulary of tens of thousands to millions when making predictions (predicting the probability of these words). This amount of calculation is very large.

BPE is a simple data compression algorithm that was first proposed in the article “A New Algorithm for Data Compression” published in 1994. Its core idea is:
Each step of BPE replaces the most common pair of adjacent data units with a new unit that has not appeared in the data, and iterates repeatedly until the stopping condition is met. The purpose is to use a limited vocabulary to solve the segmentation of all words while minimizing the number of tokens. This is possible because of the grammar of English word roots, etymology, tenses, etc., which means that many words have the same part,
For example, in the sequence aaabdaaabac, first a has the highest frequency, followed by aa, which is replacing aa with Z, and then the two characters connected together with the highest frequency are ab, so replace ab with Y, and get ZYdZYac, and so on, so Compress the original sequence of the first line into the final sequence.
Excerpted from Byte Pair Encoding – The Dark Horse of Modern NLP

NLP BPE

Subword in NLP is based on the BPE algorithm, and its process is mainly as follows:
1. Prepare the corpus and determine the total number of subwords in this table;
2. Add a suffix at the end of each word and count the word frequency of each word. For example, the word frequency of nice is 5, then it can be recorded as: “nice”:5
3. Calculate the frequency of words composed of two characters in the corpus, replace the two characters with the highest frequency in the corpus with new tags, and add the new tag n-gram to the vocabulary.
4. Recursively merge the high-frequency word frequencies in step 3. When the number of vocabulary lists is greater than the total number of subwords, recursively merge and count word frequencies until the set number of subwords is reached.
The python code for this process is as follows:

import re
from collections import Counter, defaultdict


def build_vocab(corpus: str) -> dict:
    """Step 1. Build vocab from text corpus"""

    # Separate each char in word by space and add mark end of token
    tokens = [" ".join(word) + " </w>" for word in corpus.split()]
    
    # Count frequency of tokens in corpus
    vocab = Counter(tokens)

    return vocab


def get_stats(vocab: dict) -> dict:
    """Step 2. Get counts of pairs of consecutive symbols"""

    pairs = defaultdict(int)
    for word, frequency in vocab.items():
        symbols = word.split()

        # Counting up occurrences of pairs
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i + 1]] + = frequency

    return pairs


def merge_vocab(pair: tuple, v_in: dict) -> dict:
    """Step 3. Merge all occurrences of the most frequent pair"""
    
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    
    for word in v_in:
        # replace most frequent pair in all vocabulary
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]

    return v_out


vocab = build_vocab(corpus) # Step 1

num_merges = 50 #Hyperparameter
for i in range(num_merges):

    pairs = get_stats(vocab) # Step 2

    if not pairs:
        break

    # step 3
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)

BPE achieves a perfect balance between hybrid representation at character and word levels, making it capable of managing large corpora. This behavior also allows any rare words in the vocabulary to be encoded using appropriate subword tags without introducing any “unknown” tags. This applies especially to foreign languages such as German, where it is difficult to learn a rich vocabulary because there are many compound words. With this tokenization algorithm, every word can now overcome the fear of being forgotten (athazagoraobia).

WordPiece

Google’s Bert model uses the WordPiece algorithm when segmenting words. Similar to the BPE algorithm, the WordPiece algorithm selects two subwords from the vocabulary list and merges them into new subwords each time. The biggest difference with BPE is how to select two subwords to merge: BPE selects the adjacent subword with the highest frequency to merge, while WordPiece selects the adjacent subword that can improve the language model with the greatest probability to add to the vocabulary.

The method for WordPiece to select subwords is as follows, assuming the sentence

(

?
?

)

S=(t_1,t_2,\cdots, t_n)

S=(t1?,t2?,?,tn?) consists of n sub-words,

t_i

ti? represents a subword, and assuming that each subword exists independently, then the sentence

The language model likelihood value of S is equivalent to the product of the probabilities of all subwords:

log

(

)

∑

log

(

)

\log P(s) = \sum_{i=1}^N \log P(t_i)

logP(s)=i=1∑N?logP(ti?)
Suppose that two subwords x and y at adjacent positions are merged, and the subword generated after the merger is recorded as z. At this time, the sentence

The change of S likelihood value can be expressed as:

log

(

)

(

)

(

)

log

(

)

(

)

(

)

\log P(t_z) – (log P(t_x) + log P(t_y)) = \log(\frac{ P(t_z) }{P(t_x)P(t_y)})

logP(tz?)?（logP(tx?) + logP(ty?)）=log(P(tx?)P(ty?)P(tz?)?)
The change in likelihood value is the mutual information between two subwords. In short, WordPiece selects two sub-words for merging every time. They have the largest mutual information value, that is, the two sub-words have a strong correlation in the language model. They often appear adjacently in the corpus at the same time. .

Unigram Language Model (ULM)

Like WordPiece, Unigram Language Model (ULM) also uses language models to select subwords. The difference is that the vocabulary size of the BPE and WordPiece algorithms changes from small to large, which is an incremental method. The Unigram Language Model uses a decrement method, which means first initializing a large vocabulary list and continuously discarding the vocabulary list according to the evaluation criteria until the limiting conditions are met. The ULM algorithm takes into account the different word segmentation possibilities of the sentence and therefore can output multiple sub-word segments with probability.
For sentence S,

(

?
?

)

X=(x_1,x_2,\cdots, x_m)

X=(x1?,x2?,?,xm?) is a word segmentation result of the sentence, consisting of m sub-words. Therefore, the likelihood value of sentence S under the current participle can be expressed as:

(

)

∏

(

)

P(X)=\prod \limits_{i=1}^mP(x_i)

P(X)=i=1∏m?P(xi?)
For sentence S, select the one with the largest likelihood value as the word segmentation result, which can be expressed as:

arg

max

∈

(

)

(

)

x^*=\arg \max_{x \in U(x)}P(X)

x?=argx∈U(x)max?P(X)

here

(

)

U(x)

U(x) contains all word segmentation results of the sentence. In practical applications, the size of the word list is tens of thousands, and it is not practical to directly list all possible word segmentation combinations. To solve this problem, the Viterbi algorithm can be used to obtain

x^*

x? to solve.
Probability of each word

(

)

P(x_i)

P(xi?) is calculated using the maximum expectation method. Assuming the current vocabulary V, the M-step maximization object is the following likelihood function:

∑

∣

log

(

)

∑

∣

log

(

∑

∈

(

)

(

)

L=\sum_{s=1}^{|D|}\log (P(X^{(s)}))=\sum_{s=1}^{|D|}\log( \sum_{x \in U(X^{(s)})}P(x))

L=s=1∑∣D∣?log(P(X(s)))=s=1∑∣D∣?log(x∈U(X(s))∑?P(x))
Among them, |D| is the number of corpus in the corpus. An intuitive understanding of the above formula is to add the probabilities formed by all word segment combinations of all sentences in the corpus.
Initially, the vocabulary V does not exist. Therefore, the ULM algorithm uses a continuous iteration method to construct the vocabulary and solve the word segmentation probability:
1. Initially, establish a large enough vocabulary. Generally, all characters in the corpus plus common strings can be used to initialize the vocabulary, or it can be initialized through BPE;
2. Based on the current vocabulary, use the EM algorithm to calculate the probability of buying a subword in the corpus;
3. For each word, calculate how much the total loss decreases when the word is removed from the vocabulary, and record it as the loss of the word.
4. Sort the words according to the loss size, discard a certain proportion of words with the smallest loss (for example, 20%), and generate a new vocabulary list for the retained words. It should be noted here that single characters cannot be discarded. This is to avoid OOV situations.
5. Repeat steps 2 to 4 until the vocabulary size is reduced to the set range.

Sentence Piece

SentencePiece implements the method of obtaining subwords directly from sentence training (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]). It is an open source toolkit for subwords launched by Google. It integrates BPE and ULM sub-word algorithms. In addition, SentencePiece can also support character and word level segmentation. Furthermore, in order to be able to handle multilingual problems, sentencePiece treats sentences as Unicode encoding sequences, so that the subword algorithm does not rely on language representation.

Llama uses the Sentence Piece Byte-Pair encoding (BPE) tokenizer, which is designed specifically for Llama models and should not be confused with the tokenizer (x) used by OpenAI models.
For an example of using sentence piece to encode Chinese, see Sentencepiece_python_module_example.ipynb

SentencePiece includes four parts: Normaliser, Encoder, Decoder and Trainer, which are blue in the picture below.

1.Normaliser does not mean normalizing the mean and variance of the numbers after the tokenizer. In NLP, normalization refers to standardizing the words in the text so that they follow a suitable format. The normalization method used by Sentence Piece is to change the words/letters to their NFKC Unicode equivalent (e.g. starting with U+0026). Of course, different unicode methods can also be used. For those interested, a C++ implementation of the normalizer can be found here [https://github.com/google/sentencepiece/blob/master/src/normalizer.cc].
2. The trainer uses an algorithm to build a vocabulary based on subwords. SentencePiece supports two language models: BPE and unigram.
3.Encoder and Decoder are pre-processing and post-processing respectively.
The process in the figure can use the following formula of the SentencePiece paper:

Decode(Encode(Normalized(text))) = Normalized(text)

This is called lossless tokenization.