Custom Graph Component: 1.1-JiebaTokenizer specific implementation

JiebaTokenizer class inherits from Tokenizer class, and Tokenizer class inherits from GraphComponent class, and GraphComponent class inherits from ABC class (abstract base class). This article uses the example in “Using ResponseSelector to Implement Campus Recruitment FAQ Robot” to mainly explain in detail the specific implementation of the methods in the JiebaTokenizer class. 0. List of […]

ChatGLM2 source code analysis: `ChatGLMTokenizer`

import os import torch from typing import List, Optional, Union, Dict from sentencepiece import SentencePieceProcessor from transformers import PreTrainedTokenizer from transformers.utils import logging, PaddingStrategy from transformers.tokenization_utils_base import EncodedInput, BatchEncoding # The underlying tokenizer, which is the packaging of the SP model class SPTokenizer: def __init__(self, model_path: str): # reload tokenizer assert os.path.isfile(model_path), model_path #Load the […]

java: java.util.StringTokenizer implements string cutting

java: java.util.StringTokenizer implements string cutting 1 Preface The java.util tool kit provides the string cutting tool class StringTokenizer, and the string tool classes of common frameworks such as Spring (such as Spring’s StringUtils), which are commonly used. For example, the method under Spring’s StringUtils: public static String[] tokenizeToStringArray( @Nullable String str, String delimiters, boolean trimTokens, […]

[Tongyi Qianwen] Qwen’s explanation and solution for the error ”tokenizer class not exist” when loading the tokenizer locally

Abstract: When trying to download the model file from [Hugging Face Model Hub](https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main) to local and use the `from_pretrained` method to load the tokenizer on the local disk , an error was encountered. The error originates from the `tokenization_auto.py` file and prompts “Tokenizer class QWenTokenizer does not exist or is not currently imported.” This means […]

Pure Python implementation! A purer, higher compression rate Tokenizer

?PaperWeekly original · Author | Su Jianlin Unit | Science Space Research direction | NLP, neural network Currently the most popular Tokenizer (word segmenter) in LLM should be Google’s SentencePiece [1], because it meets some of the ideal characteristics of Tokenizer, such as language-independent, data-driven, etc., and because it is written in C++, so Tokenize […]

[Create tokenizer by yourself] (3) – Unigram tokenizer

【Create your own tokenizer】WordPiece tokenizer 【Create your own tokenizer】BPE tokenizer 【Create your own tokenizer】Unigram tokenizer 1 overall step Word segmentation includes the following steps: Normalization (Normalization, necessary cleaning of the text, such as removing spaces, accents, Unicode standardization, etc.). Pre-tokenization (split the input into words). Pass the input to the model (Model, which generates a […]

IK tokenizer upgrade, MySQL hot update helps

ik tokenizer uses MySQL hot update ? The official IK tokenizer only supports hot update of remote text files, and does not support hot update of MySQL. Today I will tell you how to use MySQL to update the thesaurus of the IK tokenizer. 1. Create a database table CREATE TABLE `es_extra_main` ( `id` int(11) […]

[Create tokenizer by yourself] (1) – WordPiece tokenizer

【Create your own tokenizer】WordPiece tokenizer 【Create your own tokenizer】BPE tokenizer 【Create your own tokenizer】Unigram tokenizer 1 overall step Word segmentation includes the following steps: Normalization (Normalization, necessary cleaning of the text, such as removing spaces, accents, Unicode standardization, etc.). Pre-tokenization (split the input into words). Pass the input to the model (Model, which uses pre-tokenized […]

[Fun AIGC] sentencepiece trains a Tokenizer (marker)

Table of Contents I. Introduction 2. Installation 3. Train a tokenizer by yourself 4. Model running 5. Expansion 6. Supplement 1. Foreword Earlier we introduced a character encoding method [How to train a Chinese-English translation model] LSTM machine translation seq2seq character encoding (1) This method is to encode characters one by one, and a lot […]