[Create tokenizer by yourself] (1) – WordPiece tokenizer

【Create your own tokenizer】WordPiece tokenizer
【Create your own tokenizer】BPE tokenizer
【Create your own tokenizer】Unigram tokenizer

1 overall step

Word segmentation includes the following steps:

Normalization (Normalization, necessary cleaning of the text, such as removing spaces, accents, Unicode standardization, etc.).
Pre-tokenization (split the input into words).
Pass the input to the model (Model, which uses pre-tokenized words to generate a sequence of tokens).
Post-processing (Post-processing, adding special tokens for tokenizers, generating attention masks and token type IDs).

2 Data preparation

from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")


def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

The function get_training_corpus() is a generator that generates batches containing 1000 texts, which will be used later to train the tokenizer.
Tokenizers can also be trained directly on text files. Here’s how to generate a text file with all the text/input from WikiText-2 that we can use locally:

with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\
")

3 The whole process

3.1 import

To build a tokenizer using the Tokenizers library, first we instantiate a Tokenizer object, then set its normalizer, pre_tokenizer, post_processor, and decoder properties to the desired values.
In this example we will create a tokenizer with a WordPiece model:

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models. WordPiece(unk_token="[UNK]"))

We need to specify the unk_token so the model knows what to return when it encounters a character it hasn’t seen before. We can also set other parameters here, including our model’s vocabulary (we will be training the model, so we don’t need to set this parameter), and max_input_chars_per_word, which specifies the maximum length of each word (words longer than this length will be split ).

3.2 Normalization

The first step in word segmentation is normalization, let’s start with this step. Since BERT is widely used, there is a BertNormalizer, which contains the classic options we can set for BERT: lowercase and strip_accents, which are used to convert to lowercase and remove accents, respectively; clean_text is used to remove all control characters and replace repeated spaces For one; handle_chinese_chars is used to place spaces around Chinese characters. To replicate the bert-base-uncased tokenizer, we simply set this normalizer:

"""
The first step: normalization
For Bert is BertNormalizer
The operations are:
lowercase: convert to lowercase
strip_accents: Strip individuality (accents)
clean_text: removes all control characters and replaces repeated spaces with a single control character
handle_chinese_chars: put spaces around Chinese characters
"""
# Use the off-the-shelf BertNormalizer
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

However, in general, when building a new tokenizer, you may not have such a handy normalizer as easily as is already implemented in the Tokenizers library. So let’s see how to manually create a BERT normalizer. A Lowercase normalizer and a StripAccents normalizer are provided in the library, and you can combine multiple normalizers using Sequence:

# no existing
# Manually create the BERT normalizer:
tokenizer.normalizer = normalizers.Sequence(
    [
        # NFD Unicode normalizer
        normalizers. NFD(),
        # turn lowercase
        normalizers. Lowercase(),
        # Remove accents
        normalizers. StripAccents()
    ]
)

The NFD Unicode normalizer is also used here, otherwise the StripAccents normalizer would not recognize accented characters correctly and thus would not be able to remove them.
As we saw before, we can use the normalize_str() method of a normalizer to see its effect on a given text:

"""
Use the normalize_str() method to verify whether the normalizer has the expected effect
"""
print(tokenizer.normalizer.normalize_str("Héllò h?w are u?"))

3.3 pre-tokenization

Likewise, here is a pre-built BertPreTokenizer that we can use:

# Rely on the ready-made BertPreTokenizer in the library
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Alternatively, start over from scratch:

tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Note that the Whitespace pre-tokenizer splits on spaces and all characters that are not letters, digits, or underscore characters, so in reality it splits on spaces and punctuation, for example:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

will output:

[('Let', (0, 3)), ("'", (3, 4)), ('s', (4, 5)), ('test', (6, 10) ), ('my', (11, 13)), ('pre', (14, 17)),
 ('-', (17, 18)), ('tokenizer', (18, 27)), ('.', (27, 28))]

If you only want to split on whitespace, you should use the WhitespaceSplit pre-tokenizer:

pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer. pre_tokenize_str("Let's test my pre-tokenizer.")

will output

[("Let's", (0, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre-tokenizer.', (14 , 28))]

As with the normalizers step, you can also use a sequence to combine multiple different pre-tokenizers:

pre_tokenizer = pre_tokenizers. Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)
pre_tokenizer. pre_tokenize_str("Let's test my pre-tokenizer.")

Same output:

[('Let', (0, 3)), ("'", (3, 4)), ('s', (4, 5)), ('test', (6, 10) ), ('my', (11, 13)), ('pre', (14, 17)),
 ('-', (17, 18)), ('tokenizer', (18, 27)), ('.', (27, 28))]

3.4 Model

The next step in the tokenization process is to pass the input through the model. We’ve specified our model in initialization, but we still need to train it, which will require a WordPieceTrainer. The main caveat when instantiating the trainer in Tokenizers is that you need to pass any special tokens you intend to use – otherwise they won’t be added to the vocabulary because they are not in the training corpus:

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers. WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

Besides specifying vocab_size and special_tokens, we can also set min_frequency (how many times a word must occur to be included in the vocabulary) or change continuing_subword_prefix (if we want to use a different prefix than ##).
To train our model using the iterator we defined earlier, we just need to execute this command:

tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

We can also use text files to train our tokenizer as follows (we need to reinitialize an empty WordPiece model before training):

tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

In both cases, we can run a tokenizer test on the text by calling the encode() method:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding. tokens)

will output:

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']

The obtained encoding is an Encoding, which contains various necessary output attributes of the tokenizer: ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask and overflowing.

3.5 post-processing

The last step in the word segmentation process is post-processing. We need to add a [CLS] tag at the beginning and a [SEP] tag at the end (or after each sentence, if there is a pair of sentences). For this, we’ll use a TemplateProcessor, but first we need to know the IDs of the [CLS] and [SEP] tags in our vocabulary.

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)
out: 2 3

To write TemplateProcessor templates, we need to specify how to process single sentences and pairs of sentences. For both cases, we write the special token we want to use; the first (or single) sentence is given by

represented, while the second sentence (if encoded as a pair) is represented by

A is represented, while the second sentence (if coded as a pair) is represented by

A is represented, while the second sentence (if encoded as a pair) is represented by B. For these (special tokens and sentences), we also specify the corresponding token type ID after the colon.
The classic BERT template looks like this:

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Note that we need to pass the IDs of the special tokens so that the tokenizer can properly convert them to their IDs.
With these added, going back to the previous example we get:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding. tokens)

output:

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']

On a pair of sentences, we get the correct result:

encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding. tokens)
print(encoding. type_ids)

output:

['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '... ', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

We’re almost done building this tokenizer-the last step is to include a decoder:

tokenizer.decoder = decoders.WordPiece(prefix="##")

Test it on the previous encoding:

tokenizer.decode(encoding.ids)
out: "let's test this tokenizer... on a pair of sentences."

3.6 save and load

save

We can save the tokenizer in a JSON file as follows:

tokenizer.save("tokenizer.json")

load

We can then reload that file in the Tokenizer object using the from_file() method:

new_tokenizer = Tokenizer.from_file("tokenizer.json")

4 last

To use this tokenizer in Transformers, we need to wrap it in a PreTrainedTokenizerFast. We can use the generic class, or if our tokenizer corresponds to an existing model, the corresponding class (for example, BertTokenizerFast). If you’re building a brand new tokenizer, you’ll want to use the first option.
To wrap the tokenizer in PreTrainedTokenizerFast, we can either pass our built tokenizer as tokenizer_object, or our saved tokenizer file as tokenizer_file. The key thing to remember is that we have to manually set all the special tokens, because the class cannot deduce from the tokenizer object which tokens are mask tokens, [CLS] tokens, etc.:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # Or, load from file
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

If using a specific tokenizer class (like BertTokenizerFast ), just specify a special token different from the default one (none here):

from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

You can then use this tokenizer like any other Transformers tokenizer. You can save it with save_pretrained() method.

refer to:
https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt#building-a-wordpiece-tokenizer-from-scratch