Interpret the token of the large model (LLM)

When people talk about the size of large language models, the parameters give us an idea of how complex the neural network’s structure is, and the token size gives us an idea of how much data was used to train the parameters.

As Dr. Lu Qi said, large language models provide impressive capabilities for various tasks from text generation to question answering, not only revolutionizing the field of natural language processing (NLP), but also serving as a basic model. Change the entire software ecosystem.

A key point of these models that is often overlooked is the role of “tokens”, which are the individual units of information processed by the model. Large Language Models (LLMs) cannot truly understand raw text. Instead, the text is converted into numerical representations called tokens, and these tokens are then fed to the model for processing.

Token represents a pass or token in the blockchain, so what does token represent in LLM?

1. What is token?

In LLM, token represents the smallest unit of meaning that the model can understand and generate, and is the basic unit of the model. Depending on the specific tokenization scheme used, a token can represent a word, a part of a word, or even just a character. Token is assigned a numerical value or identifier, arranged in a sequence or vector, and is input or output from the model. It is a language component of the model.

In general, tokens can be viewed as fragments of words that are not split precisely from the beginning or end of the word, and can include trailing spaces as well as subwords and even larger linguistic units. The token serves as a bridge between the raw textual data and the numerical representation that LLM can use. LLM uses tokens to ensure text coherence and consistency, effectively handling various tasks such as writing, translating, and answering queries.

Here are some useful rules of thumb to help understand token length:

1 token ~= 4 chars in English
1 token ~= ? words
100 tokens ~= 75 words
or
1-2 sentences ~= 30 tokens
1 paragraph ~= 100 tokens
1,500 words ~= 2048 tokens

In OpenAI’s API parameters, the max_tokens parameter specifies that the model should generate a response with a maximum length of 60 tokens. You can observe token related information through https://platform.openai.com/tokenizer.

2. Characteristics of token

We can first use the OpenAI playground to look at an example “Dec 31,1993. Things are getting crazy.”

Use GPT-3 tokenizaer to convert the same words into tokens:

2.1 Mapping from token to numerical representation

The vocabulary maps tokens to unique numerical representations. LLM uses numeric input, so each token in the vocabulary is given a unique identifier or index. This mapping allows LLM to process and manipulate textual data as a sequence of numbers, enabling efficient computation and modeling.

In order to capture the meaning and semantic relationships between tokens, LLM adopts token encoding technology. These techniques convert tokens into dense digital representations called embeddings. Embedded encoding of semantic and contextual information enables LLM to understand and generate coherent and contextual text. Architectures like transformer use self-attention mechanism to learn the dependencies between tokens and generate high-quality embeddings.

2.2 Token-level operations: Precisely manipulate text

Token-level operations enable fine-grained operations on text data. LLM can generate tokens, replacement tokens, or mask tokens to modify text in meaningful ways. These token-level operations have applications in various natural language processing tasks, such as machine translation, sentiment analysis, and text summarization.

2.3 Limitations of token design

The text is tokenized before being sent to LLM for generation. A token is the way a model views input-a single character, word, part of a word, or other part of text or code. Each model performs this step differently, for example, the GPT model uses Byte Pair Encoding (BPE).

The token is assigned an id in the tokenizer generator’s vocabulary, which is a numeric identifier that binds a number to a corresponding string. For example, “Matt” is encoded as token number [13448] in GPT, while “Rickard” is encoded as two tokens, “Rick”, “ard” with id [8759,446], GPT-3 has 14 million A vocabulary of strings.

The design of token probably has the following limitations:

Case Sensitivity: Words of different cases are treated as different tokens. “hello” is token[31373], “Hello” is [15496], and “HELLO” has three tokens[13909,3069,46].
Number chunking is inconsistent. The value “380” is tagged in GPT with a single “380” token. But “381” is represented as two tokens[“38”, “1”]. “382” is also two tokens, but “383” is a single token [“383”]. Some four-digit tokens are: [“3000”], [“3”, “100”], [“35”, “00”], [“4”, “500”]. This may be why GPT-based models are not always good at math.
Trailing spaces. Some tokens have spaces, which can lead to interesting behavior with word prompts and word completions. For example, “once upon a” with a trailing space is encoded as [“once”, “upon”, “a”, ” “]. However, “once on a time” is encoded as [“once”, “upon”, “a”, “time”]. Because “time” is a single token with spaces, adding spaces to the prompt word will affect the probability of “time” being the next token.

3. The impact of token on LLM

Regarding how the number of tokens affects the response of the model, it is often confused whether more tokens make the model more detailed and specific? Personally, I think the impact of tokens on large models focuses on two aspects:

Context Window: This is the maximum number of tokens that the model can handle at one time. If the model is asked to generate more tokens than the context window, it will do so within blocks, which may lose consistency between blocks.

Training data tokens: The number of tokens in a model’s training data is a measure of the amount of information the model has learned. However, whether a model is more “general” or “detailed” is not directly related to these symbolic measures.

The generality or specificity of a model’s response depends more on its training data, fine-tuning, and the decoding strategy used in generating the response. The concept of tokens in large language models is fundamental to understanding how these models work and how to use them effectively. While the number of tokens a model can handle or has been trained on does affect its performance, the generality or detail of its responses is more a product of its training data, fine-tuning, and the decoding strategy used.

Models trained on different data tend to produce general responses, while models trained on specific data tend to produce more detailed, situation-specific responses. For example, a model fine-tuned on medical text might produce more detailed responses to medical cues.

Decoding strategies also play an important role. Modifying the “temperature” of the SoftMax function used in the model output layer can make the model’s output more diverse (higher temperatures) or more deterministic (lower temperatures). Setting the temperature value in the OpenAI API can adjust the balance between determinism and different outputs.

It is important to remember that every language model, regardless of its size or the amount of data it is trained on, is only likely to be most effective on the data it was trained on, the fine-tuning it received, and the decoding strategy used during use.

To push the limits of LLM, one can try different training and fine-tuning methods, and use different decoding strategies. Be aware of the pros and cons of these models, and always ensure that your use cases are consistent with the functionality of the model you are using.

4. Token application mechanism-tokenization

The formal process of dividing text into different tokens is called tokenization. Tokenization captures the meaning and grammatical structure of the text, thereby requiring segmentation of the text into important components.

Tokenization is the process of dividing input and output text into smaller units, which are processed by the LLM AI model. Tokenization can help models handle different languages, vocabularies, and formats and reduce computational and memory costs. It can also affect the quality and diversity of generated text by affecting the meaning and context of tokens. Depending on the complexity and variability of the text, different methods can be used for tokenization, such as rule-based methods, statistical methods, or neural methods.

OpenAI and Azure OpenAI use a subword tokenization method called “Byte-Pair Encoding (BPE)” for their GPT-based models. BPE is a method of combining the most frequently occurring pairs of characters or bytes into a single token until a certain number of tokens or vocabulary size is reached. BPE can help models handle rare or invisible words and create more compact and consistent text representations. BPE also allows the model to generate new words or tokens by combining existing words or tokens. The larger the vocabulary, the more diverse and expressive the text generated by the model will be. However, the larger the vocabulary, the more memory and computing resources the model requires. Therefore, the choice of vocabulary depends on the trade-off between the quality and efficiency of the model.

The cost of using a large model can vary widely based on the number of tokens used to interact with the model and the different rates of different models. For example, as of February 2023, the rate for using Davinci is $0.06 per 1,000 tokens, while the rate for using Ada is $0.0008 per 1,000 tokens. This ratio also changes based on the type of use, such as playground and search. Therefore, tokenization is an important factor affecting the cost and performance of running large models.

4.1 Seven types of tokenization

Tokenization involves segmenting text into meaningful units to capture its semantic and syntactic structure. Various tokenization techniques can be employed, such as word level, sub-word level (e.g., using byte pair encoding or WordPiece), or character level. Each technique has its own advantages and trade-offs, depending on the needs of a specific language and a specific task.

Byte Pair Encoding (BPE): Builds a subword vocabulary for the AI model and is used to merge frequently occurring character/subword pairs.
Subword-level tokenization: dividing words into complex languages and vocabularies. Splitting words into smaller units is important for complex languages.
Word-level tokenization: Basic text tokenization for language processing. Each word is used as a different token, it’s simple but limited.
Sentence fragmentation: Segmentation of text using learned subword fragments, segmentation based on learned subword fragments.
Word segmentation tokenization: sub-word units using different merging methods.
Byte-level tokenization: Use byte-level tokenization to handle text diversity, treating each byte as a token, which is very important for multi-language tasks.
Hybrid tokenization: balance fine details and interpretability, combining word-level and sub-word-level tokenization.

LLM has been extended with the ability to handle multilingual and multimodal inputs. To accommodate the diversity of these data, specialized tokenization methods have been developed. Multilingual markup handles multiple languages in a single model by leveraging language-specific token or sub-word technology. Multimodal markup combines text with other modalities such as images or audio, using techniques such as fusion or concatenation to effectively represent disparate data sources.

4.2 The importance of tokenization

Tokenization plays a crucial role in the efficiency, flexibility, and generalization capabilities of LLM. By breaking text into smaller, manageable units, LLM can process and generate text more efficiently, reducing computational complexity and memory requirements. Additionally, tokenization provides flexibility by adapting to different languages, domain-specific terminology, and even emerging textual forms such as internet slang or emoticons. This flexibility allows LLMs to handle a wide range of text inputs, enhancing their applicability in different domains and user contexts.

The choice of tokenization technology involves a trade-off between granularity and semantic understanding. Word-level tags capture the meaning of individual words, but may encounter out-of-vocabulary (OOV) terms or morphologically rich languages. Subword-level tokenization provides greater flexibility and handles OOV terms by breaking words into subword units. However, correctly understanding the meaning of subword markers in the context of the entire sentence is a challenge. The choice of tokenization technique depends on the specific task, language characteristics, and available computing resources.

4.3 Challenges faced by tokenization: processing noisy or irregular text data

Real-world text data often contains noise, irregularities, or inconsistencies. Tokenization faces challenges when dealing with sentences with misspellings, abbreviations, slang, or grammatical errors. Handling this noisy data requires robust preprocessing techniques and domain-specific tokenization rule adjustments. Furthermore, tokenization can encounter difficulties when dealing with languages with complex writing systems, such as token scripts or languages without clear word boundaries. Solving these challenges often involves specialized tokenization methods or adaptations of existing tokenizers.

Tokenization is model specific. Depending on the model’s vocabulary and tokenization scheme, tokens may have different sizes and meanings. For example, words like “running” and “ran” can be represented by different tokens, which affects the model’s understanding of tense or verb form. Different models train their own tokenizers, and although LLaMa also uses BPE, the tokens are also different from ChatGPT, which makes preprocessing and multi-modal modeling more complex.

5. Use of tokens in LLM applications

We need to know the token usage of the current task. Then, facing the token length limit of large models, we can try some solutions.

5.1 Token usage status

OpenAI’s API is used here, and the langchain application framework is used to build a simple application, and then describe the token usage status of the current text input.

from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)

with get_openai_callback() as cb:
    result = llm("Tell me a joke")
    print(cb)

For Agent type applications, similar methods can be used to obtain the statistical data of their respective tokens.

from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)
with get_openai_callback() as cb:
    response = agent.run(
        "Who is Olivia Wilde's boyfriend? What is his current age raised to the 2023?"
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")

5.2 Token length limit and response in LLM

Large models like GPT-3/4, LLaMA, etc. have a maximum number of tokens, beyond which they cannot accept input or generate output.

Generally, we can try the following methods to solve the problem of token length limit:

Truncation

Truncation involves removing part of the input text to fit within token constraints. This can be done by removing the beginning or end of the text, or a combination of both. However, truncation may result in the loss of important information and may affect the quality and consistency of the output produced.

Sampling

Sampling is a technique that randomly selects a subset of tokens from the input text. This allows you to retain some diversity in the input and can help generate different outputs. However, this approach (similar to truncation) may result in the loss of contextual information and reduce the quality of the generated output.

Reorganization

Another approach is to split the input text into smaller chunks or segments within symbol limits and process them sequentially. This way each block can be processed independently and the outputs can be concatenated to obtain the final result.

Coding

Encoding and decoding are common natural language processing techniques that convert textual data into numerical representation and vice versa. These techniques can be used to compress, decompress, truncate, or expand text to fit the markup constraints of the language model. This approach requires additional preprocessing steps that may affect the readability of the generated output.

Fine-tuning

Fine-tuning allows a pre-trained language model to be adapted to a specific task or domain using less task-specific data. Fine-tuning can be leveraged to address token limitations in language models by training the model to predict the next token in a sequence of text that is chunked or divided into smaller parts, each of which falls within the model’s token limitations.

6. Prospects of token-related technologies

Although tokens have traditionally represented textual units, the concept of tokens is moving beyond linguistic elements. Recent advances have explored labeling of other modalities such as images, audio, or video, allowing LLM to process and generate text alongside these modalities. This multimodal approach provides new opportunities for understanding and generating text in the context of rich and diverse data sources. It enables LLM to analyze image captions, generate text descriptions, and even provide detailed audio transcriptions.

The field of tokenization is a dynamic and evolving field of research. Future advancements may focus on addressing the limitations of tokenization, improving OOV handling, and adapting to the needs of emerging languages and text formats. Moreover, tokenization technology will continue to be improved, incorporating domain-specific knowledge and utilizing contextual information to enhance semantic understanding. The continuous development of tokenization will further empower LLM to process and generate text with higher accuracy, efficiency, and adaptability.

7. Summary

Token is the basic component that supports LLM language processing capabilities. Understanding the role of tokens in LLM, as well as the challenges and advances in tokenization, allows us to realize the full potential of these models. As we continue to explore the world of tokens, we will revolutionize the way machines understand and generate text, push the boundaries of natural language processing, and promote innovative applications in various fields.

PS. One more thing, some numbers we should know when developing large model applications are as follows:

[Reference materials and related reading]

https://python.langchain.com/docs
https://blog.langchain.dev/
OpenAI: What are tokens and how to count them?, https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
Predicting Million-byte Sequences with Multiscale Transformers, https://arxiv.org/pdf/2305.07185.pdf
https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens
https://www.anyscale.com/blog/num-every-llm-developer-should-know
How to build an app based on large models
Qcon2023: The growth of technical people in the era of large models (Simplified)
Thesis study notes: Reinforcement learning applied to OS scheduling
“Embedding in a Simple and Deep Way” Essay
Thoughts on engineering practice of LLM
Interpret fine-tuning of large models
Interpreting RLHF in ChatGPT
Interpret Toolformer
Interpretation of TaskMatrix.AI
Interpretation of LangChain
A brief analysis of multimodal machine learning
Distinguishing between Agent and Object
Comparative analysis of deep learning architectures
Large models (LLM) in the eyes of veteran programmers
20 papers on systematic learning of large models