SentencePiece, an essential tool for large model vocabulary expansion

SentencePiece, an essential tool for large model vocabulary expansion

Original Eat jelly without spitting out jelly skin Eat jelly without spitting out jelly skin 2023-08-20 09:42

included in collection

#Large model 56

#AI9

Background

As ChatGPT quickly emerges from the scene, large open source models have also blossomed in recent months. Currently, there are three main categories of open source large language models: large models derived from ChatGLM (wenda, ChatSQL, etc.), large models derived from LLaMA (Alpaca, Vicuna, BELLE, Phoenix, Chimera, etc.), and large models derived from Bloom (Bloomz, BELLE, Phoenix, etc.). Among them, ChatGLM-6B is mainly trained in Chinese and English bilinguals, LLaMA is mainly trained in Latin languages with English as the main language, and Bloom uses 46 natural languages and 13 programming languages for training.

At present, LLaMA is undoubtedly the brightest star among large open source models. However, unlike ChatGLM-6B and Bloom, they natively support Chinese. LLaMA natively only supports Latin or Cyrillic language families, and its support for Chinese is not particularly ideal. The vocabulary size of the original LLaMA model is 32K, while the vocabulary size of multi-language models (such as XLM-R, Bloom) is approximately 250K. Taking Chinese as an example, there are relatively few Chinese tokens in the LLaMA vocabulary (only a few hundred). This leads to two problems:

  • LLaMA’s native tokenizer vocabulary only contains a small number of Chinese characters. When tokenizing Chinese characters, one Chinese character is often divided into multiple tokens (2-3 Tokens can be combined into one Chinese character), which significantly reduces the efficiency of encoding and decoding.

  • Languages that did not appear or appeared only rarely in pre-training were not learned adequately.

In order to solve these problems, we may need to expand the Chinese vocabulary. For example: train a Chinese tokenizer model on the Chinese corpus, then merge the Chinese tokenizer with LLaMA’s native tokenizer, and finally obtain a merged tokenizer model by combining their vocabularies.

This article will introduce how to use the SentencePiece tool to train a word segmentation model using Chinese corpus.

Preliminary knowledge

Before explaining SentencePiece, let’s first explain the Tokenizer.

So what is a tokenizer? To put it simply, it is to convert the character sequence into a numerical sequence, corresponding to the input of the model.

Normally, Tokenizer has three granularities: word/char/subword

  • word: segment according to words, such as: Today is sunday. Split according to spaces or punctuation [today, is, sunday, .]

  • character: Word segmentation is based on single characters, which means char is the minimum granularity. For example: Today is sunday. will be divided into [t, o, d, a, y, ...., s, u, n, d, a, y, .]

  • subword: word segmentation based on the subword of the word. For example: Today is sunday. will be divided into [to, day, is, s, un, day, .]

It can be seen that these three types of granular word segmentation are completely different, each with advantages and disadvantages.

For word granular word segmentation:

  • Advantages: Word boundaries and meanings are preserved;

  • Disadvantages: 1) The vocabulary list is large, and rare words are difficult to learn; 2) OOV (words that may be outside the vocabulary list); 3) It cannot handle word morphological relationships and affix relationships, and will divide two words with the same meaning into two Different IDs are especially obvious in English, such as: cat, cats.

For character granularity word segmentation:

  • Advantages: The vocabulary list is extremely small, for example: 26 English letters can be used to combine almost all words, and more than 5,000 commonly used Chinese characters can basically be used to combine enough vocabulary;

  • Disadvantages: 1) It cannot carry rich semantics, especially in English, but it is more reasonable in Chinese, and this method is often used in Chinese. 2) The sequence length increases significantly;

Finally, in order to balance the above two methods, word segmentation based on subword is proposed: it can better balance the vocabulary size and semantic expression ability; common subword algorithms include Byte-Pair Encoding (BPE) / Byte-level BPE (BBPE) ), Unigram LM, WordPiece, SentencePiece, etc.

  • BPE: byte pair encoding. The core idea is to start from letters and continue to find the two consecutive tokens with the highest word frequency to merge until the target number of words is reached.

  • BBPE: The core idea of BBPE extends BPE from the character level to the subsection (Byte) level. One problem with BPE is that the base character set can be very large if unicode encoding is encountered. BBPE uses one byte as a “character”, regardless of how many bytes the actual character set uses to represent a character. In this case, the size of the basic character set is locked at 256 (2^8). The advantage of using BBPE is that the vocabulary can be shared across languages and the size of the vocabulary can be significantly reduced. The disadvantage is that for languages like Chinese, the sequence length of a paragraph of text will increase significantly. Therefore, the BBPE based model may perform better than the BPE based model. However, the BBPE sequence is slightly longer than BPE, which also results in longer training/inference time. In fact, there is no big difference in implementation between BBPE and BPE, except that the basic vocabulary uses a 256 byte set.

  • WordPiece: The WordPiece algorithm can be seen as a variant of BPE. The difference is that WordPiece generates new subwords based on probability rather than the next most frequent byte pair. The WordPiece algorithm also selects two subwords from the word list each time and merges them into new subwords. BPE selects the adjacent subwords with the highest frequency to merge, while WordPiece selects the adjacent subwords with the highest language model probability to be added to the vocabulary.

  • Unigram: A big difference between it and BPE and WordPiece on the surface is that the first two initialize a small vocabulary list and then increase it one by one to a limited vocabulary, while the Unigram Language Model first initializes a large vocabulary list , and then continuously reduce the word list through language model evaluation until the vocabulary size is limited.

  • SentencePiece: SentencePiece is a sub-word open source toolkit launched by Google. It treats a sentence as a whole and then breaks it into fragments without retaining the concept of natural words. Generally, it treats spaces as a special character, and then uses BPE or Unigram algorithm to construct a vocabulary. In addition to integrating BPE and ULM sub-word algorithms, SentencePiece can also support character and word-level word segmentation.

The figure below shows the word segmentation algorithms used by some mainstream models. For example, GPT-1 uses BPE to implement word segmentation, and LLaMA/BLOOM/GPT2/ChatGLM uses BBPE to implement word segmentation. BERT/DistilBERT/Electra uses WordPiece for word segmentation, while XLNet uses SentencePiece for word segmentation.

Picture

image.png

From the table above, we can also see that many of the current mainstream open source large models use SentencePiece to implement word segmenters based on the BBPE algorithm. The following explains the specific use of the SentencePiece tool.

SentencePiece Introduction

SentencePiece is an unsupervised text tokenizer and detokenizer mainly used in neural network-based text generation systems, where the vocabulary size is predetermined before the neural network model is trained. SentencePiece implements subword units (e.g., byte pair encoding (BPE)) and unigram language models) and can train subword models directly from raw sentences. This allows us to make a pure end-to-end system that does not rely on language-specific pre- and post-processing.

SentencePiece properties

The unique number of Tokens is predetermined

Neural machine translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms that assume unlimited vocabulary, SentencePiece keeps the final vocabulary size fixed when training the word segmentation model, such as: 8k, 16k or 32k.

Train from original sentences

Previous sub-word implementations assumed that the input sentences were pre-tokenized. This constraint is necessary for efficient training, but complicates preprocessing since we have to run the language-dependent tokenizer ahead of time. SentencePiece’s implementation is fast enough to train models from raw sentences. This is useful for training tokenizers and detokenizers for Chinese and Japanese since there are no explicit spaces between these words.

Spaces are considered basic symbols

The first step in natural language processing is text tokenization.

For example, a standard English tokenizer will segment the text Hello world. It is divided into three tokens: [Hello] [World] [.]. This situation will result in irreversible conversion of the original input and tokenized sequences. For example, there is no space between “World” and “.”. Whitespace will be removed from the tokenized sequence, for example: Tokenize("World.") == Tokenize("World .")

However, SentencePiece treats input text as a sequence of Unicode characters. Spaces are also treated as normal symbols. To explicitly handle whitespace as a base token, SentencePiece first escapes whitespace using the metasymbol ” ” (U+2581).

Hello World.

Then, split this text into small chunks, for example:

[Hello] [ Wor] [ld] [.]

Since whitespace is preserved in the segmented text, we can detokenize the text without ambiguity.

detokenized = ''.join(pieces).replace(' ', ' ')

This feature enables detokenization without relying on language-specific resources.

Note: We cannot apply the same lossless transformation when splitting sentences using standard tokenizers because they treat spaces as special symbols. Tokenized sequences do not retain the information needed to recover the original sentence.

Subword regularization and BPE-dropout

Subword regularization and BPE-dropout are simple regularization methods that actually enhance the training data with real-time subword sampling, which helps improve the accuracy and robustness of neural machine translation (NMT) models.

To enable subword regularization, you can integrate the SentencePiece library (C++/Python) into your NMT system to sample one segment for each parameter update, unlike standard offline data preparation.

Below is an example of a Python library.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
... s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
[' ', 'N', 'e', 'w', ' York']
[' ', 'New', ' York']
[' ', 'New', ' Y', 'o', 'r', 'k']
[' ', 'New', ' York']
[' ', 'New', ' York']

You’ll notice that New York segments differently with each call to SampleEncode (C++) or encoding with enable_sampling=True (Python). Details of the sampling parameters can be found in sentencepiece_processor.h.

SentencePiece technical advantages

  • Purely data-driven: SentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required.

  • Language independent: SentencePiece treats sentences as sequences of Unicode characters. There is no language-dependent logic.

  • Multi-subword algorithm: supports BPE and unigram language models.

  • Subword regularization: SentencePiece implements subword regularization and BPE-dropout subword sampling, which helps improve the robustness and accuracy of the NMT model.

  • Fast and lightweight: Segmentation speed is about 50k sentences/second and memory footprint is about 6MB.

  • Self-contained: As long as the same model file is used, the same tokenization/detokenization can be obtained.

  • Direct vocabulary ID generation: SentencePiece manages vocabulary-to-ID mapping and can generate vocabulary ID sequences directly from original sentences.

  • NFKC-based normalization: SentencePiece performs NFKC-based text normalization.

Comparison between SentencePiece and other implementations

Model Training data amount Model parameters Training data range words Table size Word segmentation algorithm Word segmenter (Tokenizer) backend
LLaMA 1T~1.4T tokens (7B/13B uses 1T, 33B/65B uses 1.4T) 7B~65B A Latin language with English as the main language 32000 BBPE Implemented based on SentencePiece tool
ChatGLM-6B About 1T tokens 6B Chinese and English bilingual 130528 BBPE Based on SentencePiece tool
Bloom 1.6TB preprocessed text, converted to 350B unique tokens 6B 46 natural Language, 13 programming languages 250680 BBPE Tokenizers of HuggingFace (class SentencePiece)
Characteristics SentencePiece subword-nmt WordPiece
Supported algorithms BPE, unigram, char, word BPE BPE*
Is it open source? Yes Yes Google internal
Whether to support subword regularization Yes No No
Whether to provide a Python library (pip ) Yes No N/A
Whether C++ library is provided Yes No N/A
Is pre-splitting required? No Yes Yes
Whether normalization can be customized (for example: NFKC) Yes No N/A
Whether the ID is generated directly Yes No N/A

Note: The BPE algorithm used in WordPiece is slightly different than the original BPE.

Environment installation

SentencePiece is divided into two parts: training model and using model. Among them, the training model part is implemented in C language, which can be compiled into a two-process program for execution. After training, a model file and a dictionary file are generated.

The model usage part supports both binary programs and Python calling methods. The dictionary data generated after training is plain text and editable, so it can also be read and used in any other language.

Build and install the SentencePiece command line tool from C++ sources

Since we need command line tool model training, we need to install the SentencePiece command line tool first.

Building SentencePiece requires the following tools and libraries:

  • cmake

  • C++11 compiler

  • gperftools library (optional, can get 10-40% performance improvement)

On Ubuntu, you can use apt-get to install the build tools:

sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Next, build and install the command line tools as follows.

git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake..
make -j $(nproc)
make install
ldconfig -v

View command usage documentation:

spm_train --help

Use pip to install the sentencepiece library

SentencePiece provides Python wrappers that support SentencePiece training and segmentation. Since the model will be used later based on the Python language, use pip to install the Python binary package of SentencePiece.

pip install sentencepiece

Training model

Since the official website only provides English and Japanese data, if you use Chinese for model training, you need to download the Chinese training data first. This article uses Dream of Red Mansions (the data needs to be pre-cleaned by yourself) for model training.

spm_train --input=/workspace/data/book/hongluomeng_clean.txt --model_prefix=/workspace/model/book/hongluomeng-tokenizer --vocab_size=4000 --character_coverage=0.9995 --model_type=bpe

Parameter Description:

  • –input: Training corpus file, you can pass a comma-separated list of files. The file format is one sentence per line. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece uses Unicode NFKC normalized input.

  • –model_prefix: Output model name prefix. After training is completed, the .model and .vocab files will be generated.

  • –vocab_size: Vocabulary size after training, for example: 8000, 16000 or 32000

  • –character_coverage: The number of characters covered by the model. The recommended default value is 0.9995 for languages with rich character sets (such as Japanese or Chinese), and the recommended default value is 1.0 for other languages with smaller character sets.

  • –model_type: model type. Optional values: unigram (default), bpe, char, or word. When using the word type, the input sentence must be pretokenized.

working process:

> spm_train --input=/workspace/data/book/hongluomeng_clean.txt --model_prefix=/workspace/model/book/hongluomeng-tokenizer --vocab_size=4000 --character_coverage=0.9995 --model_type=bpe
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with:
trainer_spec {
  input: /workspace/data/book/hongluomeng_clean.txt
  input_format:
  model_prefix: /workspace/model/book/hongluomeng-tokenizer
  model_type: BPE
  vocab_size: 4000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter:
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars:
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece:
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface: ?
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(183) LOG(INFO) Loading corpus: /workspace/data/book/hongluomeng_clean.txt
trainer_interface.cc(378) LOG(WARNING) Found too long line (4224 > 4192).
trainer_interface.cc(380) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(381) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.
trainer_interface.cc(407) LOG(INFO) Loaded all 3144 sentences
trainer_interface.cc(414) LOG(INFO) Skipped 6 too long sentences.
trainer_interface.cc(423) LOG(INFO) Adding meta_piece:
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
trainer_interface.cc(537) LOG(INFO) all chars count=866703
trainer_interface.cc(548) LOG(INFO) Done: 99.95% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=3986
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9995
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 3144 sentences.
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 3144
trainer_interface.cc(608) LOG(INFO) Done! 3395
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=10909 min_freq=13
trainer_interface.cc(686) LOG(INFO) Saving model: /workspace/model/book/hongluomeng-tokenizer.model
trainer_interface.cc(698) LOG(INFO) Saving vocabs: /workspace/model/book/hongluomeng-tokenizer.vocab

Model output file (vocabulary and model weights):

> ls -al /workspace/model/book
total 328
drwxr-xr-x 2 root root 4096 May 19 01:55 .
drwxrwxrwx 21 root root 4096 May 19 01:55 ..
-rw-r--r-- 1 root root 285840 May 19 01:55 hongluomeng-tokenizer.model
-rw-r--r-- 1 root root 38885 May 19 01:55 hongluomeng-tokenizer.vocab

Check out the word list:

> head -n20 /workspace/model/book/hongluomeng-tokenizer.vocab
0
<s> 0
</s> 0
:" -0
. "     -1
Baoyu -2
smile -3
?" -4
Wife -5
What -6
Sister Feng -7
1 -8
Jia Mu -9
Nor -10
, -11
. -12
-13
No -14
-15
1-16

Use models

Based on command line usage model

Encode raw text into sentence fragments (tokens).

> echo "The white sun covers the mountains, and the Yellow River flows into the sea." | spm_encode --model=/workspace/model/book/hongluomeng-tokenizer.model
During the day, the sun stretches over the mountains, and the Yellow River flows into the sea.

Encode the original text into sentence fragment (Token) id. Note: The --output_format parameter defaults to piece.

> echo "The sun sets over the mountains, and the Yellow River flows into the sea." | spm_encode --model=/workspace/model/book/hongluomeng-tokenizer.model --output_format=id
60 254 70 333 468 400 14 733 1476 317 603 510 15

Decode the sentence fragment (token) id into raw text.

> echo "60 254 70 333 468 400 14 733 1476 317 603 510 15" | spm_decode --model=/workspace/model/book/hongluomeng-tokenizer.model --input_format=id
As the day passes over the mountains, the Yellow River flows into the sea.

Export vocabulary based on model files.

# spm_export_vocab --model=<model file> --output=<output file>
spm_export_vocab --model=/workspace/model/book/hongluomeng-tokenizer.model --output=/workspace/output/hongluomeng.vocab

Among them, --output specifies the output file, which stores the vocabulary list and emission log probabilities. The vocabulary id corresponds to the line number in this file.

The official website also provides end-to-end (including: training (spm_train), encoding (spm_encode) and decoding (spm_decode)) examples, as shown below:

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with:
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
I saw a girl with a te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

#The original input sentence is restored from the vocabulary id sequence
% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

Based on Python library usage model

>>> import sentencepiece as spm
>>>
>>> sp = spm.SentencePieceProcessor()
>>>
>>> text="This Jia Yucun is originally from Huzhou, and is also a family of poets, writers, and officials. Because he was born in the last days, his parents and ancestors have lost their roots, and the population has declined. He is the only one left. He is of no use in his hometown. Because he went to Beijing to gain fame and then reorganize his foundation."
>>>
>>> sp.Load("/workspace/model/book/hongluomeng-tokenizer.model")
True
>>> print(sp.EncodeAsPieces(text))
[' ', 'This', 'Jia', 'Yu', 'Village', 'Yuan', 'Xi', 'Hu', 'zhou', '人', 'Shi', ',', ' Ye', 'is', 'poem', 'book', 'official', 'official', 'zhi', 'family', ',', 'because', 'he', 'sheng', 'yu' , 'mo', '世', ',', 'father', 'mother', 'ancestor', 'ancestry', 'root', 'base', 'already', 'end', ',', ' '人', '口', 'decline', 'mourning', ',', 'only', 'left', 'get', 'him', '一', 'body', '一', '口' , ',', 'in', 'home', 'township', 'wu', 'benefit', ',', 'because', 'jin', '京', 'quest', 'take', ' Gong', 'name', ',', 'zai', 'whole', 'base', 'karma', '. ']

In addition, we can also merge the new trained vocabulary with the original vocabulary. For details, please refer to Chinese-LLaMA-Alpaca’s 20K Chinese word list based on sentencepiece training on general Chinese corpus and merged with the 32K word list of the original LLaMA model (HF implements LLaMA word segmentation based on the BBPE algorithm, and the underlying method is also the sentencepiece method). code.

Conclusion

This article mainly explains to you the basic principles and usage of SentencePiece. If we analyze issues related to a certain field, we can use SentencePiece to train a word segmentation model based on books and documents in the field. SentencePiece is not limited to the content being analyzed itself. The more training data, the better the model is.

Reference documentation:

  • SentencePieces

  • BPE, WordPiece and SentencePiece

  • Tokenizer in large models: BPE, WordPiece, Unigram LM, SentencePiece

  • sentencepiece principles and practice

  • [OpenLLM 008] Tokenizer, the basic component of the large model – a long article of 10,000 words comprehensively interprets the tokenization algorithm and tokenizers in LLM (tokenization & tokenizers): BPE/WordPiece/ULM & beyond

  • Summary of the tokenizers