SentencePiece, an essential tool for large model vocabulary expansion
Original Eat jelly without spitting out jelly skin Eat jelly without spitting out jelly skin 2023-08-20 09:42
included in collection
#Large model 56
#AI9
Background
As ChatGPT quickly emerges from the scene, large open source models have also blossomed in recent months. Currently, there are three main categories of open source large language models: large models derived from ChatGLM (wenda, ChatSQL, etc.), large models derived from LLaMA (Alpaca, Vicuna, BELLE, Phoenix, Chimera, etc.), and large models derived from Bloom (Bloomz, BELLE, Phoenix, etc.). Among them, ChatGLM-6B is mainly trained in Chinese and English bilinguals, LLaMA is mainly trained in Latin languages with English as the main language, and Bloom uses 46 natural languages and 13 programming languages for training.
Model | Training data amount | Model parameters | Training data range | words Table size | Word segmentation algorithm | Word segmenter (Tokenizer) backend |
---|---|---|---|---|---|---|
LLaMA | 1T~1.4T tokens (7B/13B uses 1T, 33B/65B uses 1.4T) | 7B~65B | A Latin language with English as the main language | 32000 | BBPE | Implemented based on SentencePiece tool |
ChatGLM-6B | About 1T tokens | 6B | Chinese and English bilingual | 130528 | BBPE | Based on SentencePiece tool |
Bloom | 1.6TB preprocessed text, converted to 350B unique tokens | 6B | 46 natural Language, 13 programming languages | 250680 | BBPE | Tokenizers of HuggingFace (class SentencePiece) |
Characteristics | SentencePiece | subword-nmt | WordPiece |
---|---|---|---|
Supported algorithms | BPE, unigram, char, word | BPE | BPE* |
Is it open source? | Yes | Yes | Google internal |
Whether to support subword regularization | Yes | No | No |
Whether to provide a Python library (pip ) | Yes | No | N/A |
Whether C++ library is provided | Yes | No | N/A |
Is pre-splitting required? | No | Yes | Yes |
Whether normalization can be customized (for example: NFKC) | Yes | No | N/A |
Whether the ID is generated directly | Yes | No | N/A |
Note: The BPE algorithm used in WordPiece is slightly different than the original BPE.
Environment installation
SentencePiece is divided into two parts: training model and using model. Among them, the training model part is implemented in C language, which can be compiled into a two-process program for execution. After training, a model file and a dictionary file are generated.
The model usage part supports both binary programs and Python calling methods. The dictionary data generated after training is plain text and editable, so it can also be read and used in any other language.
Build and install the SentencePiece command line tool from C++ sources
Since we need command line tool model training, we need to install the SentencePiece command line tool first.
Building SentencePiece requires the following tools and libraries:
-
cmake
-
C++11 compiler
-
gperftools library (optional, can get 10-40% performance improvement)
On Ubuntu, you can use apt-get to install the build tools:
sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
Next, build and install the command line tools as follows.
git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build cd build cmake.. make -j $(nproc) make install ldconfig -v
View command usage documentation:
spm_train --help
Use pip to install the sentencepiece library
SentencePiece provides Python wrappers that support SentencePiece training and segmentation. Since the model will be used later based on the Python language, use pip to install the Python binary package of SentencePiece.
pip install sentencepiece
Training model
Since the official website only provides English and Japanese data, if you use Chinese for model training, you need to download the Chinese training data first. This article uses Dream of Red Mansions (the data needs to be pre-cleaned by yourself) for model training.
spm_train --input=/workspace/data/book/hongluomeng_clean.txt --model_prefix=/workspace/model/book/hongluomeng-tokenizer --vocab_size=4000 --character_coverage=0.9995 --model_type=bpe
Parameter Description:
-
–input: Training corpus file, you can pass a comma-separated list of files. The file format is one sentence per line. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece uses Unicode NFKC normalized input.
-
–model_prefix: Output model name prefix. After training is completed, the
.model and .vocab files will be generated. -
–vocab_size: Vocabulary size after training, for example: 8000, 16000 or 32000
-
–character_coverage: The number of characters covered by the model. The recommended default value is 0.9995 for languages with rich character sets (such as Japanese or Chinese), and the recommended default value is 1.0 for other languages with smaller character sets.
-
–model_type: model type. Optional values: unigram (default), bpe, char, or word. When using the word type, the input sentence must be pretokenized.
working process:
> spm_train --input=/workspace/data/book/hongluomeng_clean.txt --model_prefix=/workspace/model/book/hongluomeng-tokenizer --vocab_size=4000 --character_coverage=0.9995 --model_type=bpe sentencepiece_trainer.cc(77) LOG(INFO) Starts training with: trainer_spec { input: /workspace/data/book/hongluomeng_clean.txt input_format: model_prefix: /workspace/model/book/hongluomeng-tokenizer model_type: BPE vocab_size: 4000 self_test_sample_size: 0 character_coverage: 0.9995 input_sentence_size: 0 shuffle_input_sentence: 1 seed_sentencepiece_size: 1000000 shrinking_factor: 0.75 max_sentence_length: 4192 num_threads: 16 num_sub_iterations: 2 max_sentencepiece_length: 16 split_by_unicode_script: 1 split_by_number: 1 split_by_whitespace: 1 split_digits: 0 pretokenization_delimiter: treat_whitespace_as_suffix: 0 allow_whitespace_only_pieces: 0 required_chars: byte_fallback: 0 vocabulary_output_piece_score: 1 train_extremely_large_corpus: 0 hard_vocab_limit: 1 use_all_vocab: 0 unk_id: 0 bos_id: 1 eos_id: 2 pad_id: -1 unk_piece: bos_piece: <s> eos_piece: </s> pad_piece: <pad> unk_surface: ? enable_differential_privacy: 0 differential_privacy_noise_level: 0 differential_privacy_clipping_threshold: 0 } normalizer_spec { name: nmt_nfkc add_dummy_prefix: 1 remove_extra_whitespaces: 1 escape_whitespaces: 1 normalization_rule_tsv: } denormalizer_spec {} trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(183) LOG(INFO) Loading corpus: /workspace/data/book/hongluomeng_clean.txt trainer_interface.cc(378) LOG(WARNING) Found too long line (4224 > 4192). trainer_interface.cc(380) LOG(WARNING) Too long lines are skipped in the training. trainer_interface.cc(381) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag. trainer_interface.cc(407) LOG(INFO) Loaded all 3144 sentences trainer_interface.cc(414) LOG(INFO) Skipped 6 too long sentences. trainer_interface.cc(423) LOG(INFO) Adding meta_piece: trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <s> trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s> trainer_interface.cc(428) LOG(INFO) Normalizing sentences... trainer_interface.cc(537) LOG(INFO) all chars count=866703 trainer_interface.cc(548) LOG(INFO) Done: 99.95% characters are covered. trainer_interface.cc(558) LOG(INFO) Alphabet size=3986 trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9995 trainer_interface.cc(591) LOG(INFO) Done! preprocessed 3144 sentences. trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 3144 trainer_interface.cc(608) LOG(INFO) Done! 3395 bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=10909 min_freq=13 trainer_interface.cc(686) LOG(INFO) Saving model: /workspace/model/book/hongluomeng-tokenizer.model trainer_interface.cc(698) LOG(INFO) Saving vocabs: /workspace/model/book/hongluomeng-tokenizer.vocab
Model output file (vocabulary and model weights):
> ls -al /workspace/model/book total 328 drwxr-xr-x 2 root root 4096 May 19 01:55 . drwxrwxrwx 21 root root 4096 May 19 01:55 .. -rw-r--r-- 1 root root 285840 May 19 01:55 hongluomeng-tokenizer.model -rw-r--r-- 1 root root 38885 May 19 01:55 hongluomeng-tokenizer.vocab
Check out the word list:
> head -n20 /workspace/model/book/hongluomeng-tokenizer.vocab 0 <s> 0 </s> 0 :" -0 . " -1 Baoyu -2 smile -3 ?" -4 Wife -5 What -6 Sister Feng -7 1 -8 Jia Mu -9 Nor -10 , -11 . -12 -13 No -14 -15 1-16
Use models
Based on command line usage model
Encode raw text into sentence fragments (tokens).
> echo "The white sun covers the mountains, and the Yellow River flows into the sea." | spm_encode --model=/workspace/model/book/hongluomeng-tokenizer.model During the day, the sun stretches over the mountains, and the Yellow River flows into the sea.
Encode the original text into sentence fragment (Token) id. Note: The --output_format
parameter defaults to piece
.
> echo "The sun sets over the mountains, and the Yellow River flows into the sea." | spm_encode --model=/workspace/model/book/hongluomeng-tokenizer.model --output_format=id 60 254 70 333 468 400 14 733 1476 317 603 510 15
Decode the sentence fragment (token) id into raw text.
> echo "60 254 70 333 468 400 14 733 1476 317 603 510 15" | spm_decode --model=/workspace/model/book/hongluomeng-tokenizer.model --input_format=id As the day passes over the mountains, the Yellow River flows into the sea.
Export vocabulary based on model files.
# spm_export_vocab --model=<model file> --output=<output file> spm_export_vocab --model=/workspace/model/book/hongluomeng-tokenizer.model --output=/workspace/output/hongluomeng.vocab
Among them, --output
specifies the output file, which stores the vocabulary list and emission log probabilities. The vocabulary id corresponds to the line number in this file.
The official website also provides end-to-end (including: training (spm_train), encoding (spm_encode) and decoding (spm_decode)) examples, as shown below:
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000 unigram_model_trainer.cc(494) LOG(INFO) Starts training with: input: "../data/botchan.txt" ... <snip> unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091 trainer_interface.cc(272) LOG(INFO) Saving model: m.model trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab % echo "I saw a girl with a telescope." | spm_encode --model=m.model I saw a girl with a te le s c o pe . % echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id 9 459 11 939 44 11 4 142 82 8 28 21 132 6 #The original input sentence is restored from the vocabulary id sequence % echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id I saw a girl with a telescope.
Based on Python library usage model
>>> import sentencepiece as spm >>> >>> sp = spm.SentencePieceProcessor() >>> >>> text="This Jia Yucun is originally from Huzhou, and is also a family of poets, writers, and officials. Because he was born in the last days, his parents and ancestors have lost their roots, and the population has declined. He is the only one left. He is of no use in his hometown. Because he went to Beijing to gain fame and then reorganize his foundation." >>> >>> sp.Load("/workspace/model/book/hongluomeng-tokenizer.model") True >>> print(sp.EncodeAsPieces(text)) [' ', 'This', 'Jia', 'Yu', 'Village', 'Yuan', 'Xi', 'Hu', 'zhou', '人', 'Shi', ',', ' Ye', 'is', 'poem', 'book', 'official', 'official', 'zhi', 'family', ',', 'because', 'he', 'sheng', 'yu' , 'mo', '世', ',', 'father', 'mother', 'ancestor', 'ancestry', 'root', 'base', 'already', 'end', ',', ' '人', '口', 'decline', 'mourning', ',', 'only', 'left', 'get', 'him', '一', 'body', '一', '口' , ',', 'in', 'home', 'township', 'wu', 'benefit', ',', 'because', 'jin', '京', 'quest', 'take', ' Gong', 'name', ',', 'zai', 'whole', 'base', 'karma', '. ']
In addition, we can also merge the new trained vocabulary with the original vocabulary. For details, please refer to Chinese-LLaMA-Alpaca’s 20K Chinese word list based on sentencepiece training on general Chinese corpus and merged with the 32K word list of the original LLaMA model (HF implements LLaMA word segmentation based on the BBPE algorithm, and the underlying method is also the sentencepiece method). code.
Conclusion
This article mainly explains to you the basic principles and usage of SentencePiece. If we analyze issues related to a certain field, we can use SentencePiece to train a word segmentation model based on books and documents in the field. SentencePiece is not limited to the content being analyzed itself. The more training data, the better the model is.
Reference documentation:
-
SentencePieces
-
BPE, WordPiece and SentencePiece
-
Tokenizer in large models: BPE, WordPiece, Unigram LM, SentencePiece
-
sentencepiece principles and practice
-
[OpenLLM 008] Tokenizer, the basic component of the large model – a long article of 10,000 words comprehensively interprets the tokenization algorithm and tokenizers in LLM (tokenization & tokenizers): BPE/WordPiece/ULM & beyond
-
Summary of the tokenizers