Revealed: Use only one 3090 to easily and efficiently deploy the InternLM-20B large model!

On September 20, the Shanghai Artificial Intelligence Laboratory launched the 20 billion parameter version InternLM-20B of the Scholar Puyu Large Model (InternLM). It has excellent comprehensive performance, and with less than one-third of the parameters, the evaluation results have reached the level of Llama2-70B.

20 billion parameters, accounting for approximately 40G of video memory. This means that to be able to reason about InternLM-20B, you have to rent an expensive A100 server. Is there any more economical way? The answer is yes. That’s low-bit quantization and inference.

LMDeploy supports 4-bit quantization of the LLM model, and the memory overhead of weights is directly reduced to 1/4. Only 16G of video memory is needed to reason a 20B model. Moreover, the inference speed is more than 2 times higher than that of FP16. A 3090 graphics card can properly perform InternLM-20B model inference. Isn’t it amazing?

Next, please follow this article to get an overview of the entire process of deploying InternLM-20B model quantification and inference using LMDeploy under 3090.

https://github.com/InternLM/lmdeploy

(Welcome, click at the end of the article to read the original text to go directly)

4bit weight quantification

How to use

LMDeploy has uploaded the quantified internlm-chat-20b model to HuggingFace and ModelScope. You can download it directly and refer to the “4bit Model Inference Service” chapter below to deploy the model.

b139f7246ee4c22af385b36f241c78ce.png

Or use the following command to quantize the InternLM-20B model weights to 4bit from scratch.

# Download internlm-chat-20b model to local
git-lfs install
git clone --depth=1 https://huggingface.co/internlm/internlm-chat-20b
# Install lmdeploy
pip install 'lmdeploy>=0.0.13'
# Calibrate
lmdeploy lite calibrate --model-path ./internlm-chat-20b --work-dir ./internlm-chat-20b-4bit
# Quantitative weight
lmdeploy lite auto_awq --model-path ./internlm-chat-20b --work-dir ./internlm-chat-20b-4bit

Quantification Principle

Before formally introducing the LMDeploy quantification scheme, two concepts need to be introduced:

  • Compute-bound: refers to the inference process, most of the time is spent on numerical calculations; for computationally intensive scenarios, the calculation speed can be improved by using faster hardware computing units, such as quantization using W8A8 using INT8 Tensor Core to accelerate calculations.

  • Memory-bound: refers to the inference process, most of the time is spent on data reading; for memory-bound scenarios, generally by reducing the number of memory accesses, increasing the calculation memory access ratio or reducing the access Stock to optimize.

Due to the characteristics of the Decoder Only architecture of common LLM models, most of the actual reasoning time is spent in the Token-by-Token generation stage (Decoding stage), which is a typical memory access-intensive scenario. As shown in the figure below, the FP16 peak computing power of A100 is 312 TFLOPS. Only when the Batch Size reaches the level of 128, the calculation becomes the bottleneck of inference. However, because the LLM model itself is very large, KV Cache will also be generated during inference. It will occupy a lot of video memory, and there are some other factors (such as Dynamic Batch). It is difficult to achieve a Batch Size as large as 128 during actual inference.

1646187a112ec5091b28fd7f9cd3cfcd.png

So, how to optimize the memory access intensive problem in LLM model inference through quantization? The answer is 4bit Weight Only Quantification (W4A16).

4bit Weight quantification, quantizing the FP16 model weight into INT4. During Kernel calculation, the memory access amount is directly reduced to 1/4 of the FP16 model, which greatly reduces the memory access cost and improves the speed of decoding. Acceleration also saves video memory. The same device can support larger models and longer conversation lengths, which can be said to kill two birds with one stone.

The underlying logic of 4-bit quantization has been introduced, but what is Weight Only? As mentioned just now, computing-intensive scenarios can be quantized through W8A8 and use INT8 Tensor Core to accelerate numerical calculations. Weight Only is not optimized in the numerical calculation part. During calculation, the 4-bit weight will first be de-quantized back to FP16 ( It is carried out inside the Kernel, it is still 4bit when reading from Global Memory), and the FP16 computing unit is still used.

The AWQ algorithm is used in LMDeploy for 4-bit Weight Only quantization. AWQ is an algorithm proposed by MIT Han Song’s team. It is mainly optimized for the “Outlier” that often occurs in LLM inference. Compared with the popular GPTQ algorithm, the AWQ algorithm can obtain the quantized model better and faster. The AWQ Kernel The inference speed will also be faster than the GPTQ Kernel. ps: The AWQ Kernel in LMDeploy is faster than the official open source Kernel.

Students who are interested in the principles of AWQ or GPTQ algorithms are welcome to come to LMDeploy Discussion to communicate and discuss with developers. We will also explain the AWQ & GPTQ algorithms in detail in the future~

https://github.com/InternLM/lmdeploy/discussions

4bit model inference

Whether downloading the quantization model directly or quantizing from zero, at this point, we should already have the quantization model of internlm-chat-20b at hand, which is located in the folder internlm-chat-20b-4bit.

Model conversion

Before inference, you must first use the command lmdeploy convert to convert the model into the model format required by LMDeploy. Let’s get to know this command. Through lmdeploy convert –help, you can see the detailed description of the command. More important parameters:

SYNOPSIS
    lmdeploy convert MODEL_NAME MODEL_PATH <flags>


POSITIONAL ARGUMENTS
    MODEL_NAME
        The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc.
    MODEL_PATH
        The directory path of the model


FLAGS
    -m, --model_format=MODEL_FORMAT
        Type: Optional[str]
        Default: None
        The format of the model, fb or hf. 'fb' stands for META's llama format, and 'hf' means huggingface format.
    -d, --dst_path=DST_PATH
        Type: str
        Default: './workspace'
        The destination path that saves outputs.
    --tp=TP
        Type: int
        Default: 1
        The number of GPUs used for tensor parallelism, which should be 2^n.
    -g, --group_size=GROUP_SIZE
        Type: int
        Default: 0
        A parameter used in AWQ to quantize fp16 weights to 4 bits.

in:

  • model_name is the short name of the model and is used to index the model’s dialogue template. For an introduction to conversation templates, please refer to the LMDeploy GitHub documentation

  • model_format represents the input model format. For quantitative models, fill in awq

  • group_size is the group_size size used when quantizing the model. Because LMDeploy’s inference engine TurboMind only supports group_size=128, so currently fill in 128.

  • tp refers to the number of graphics cards used for model inference. LMDeploy will split the model weights based on this parameter. During inference, it will be loaded onto different graphics cards to perform tensor parallel calculations.

Then, the conversion command for internlm-chat-20b-4bit is:

lmdeploy convert \
    --model-name internlm-chat \
    --model-path ./internlm-chat-20b-4bit \
    --model-format awq \
    --group-size 128 \
    --tp 1

After completion, the conversion results are stored in the default path ./workspace.

Talk to the model

After converting the model, you can start a command in the console and simply talk to the model to feel whether the dialogue is correct:

lmdeploy chat turbomind ./workspace

You can also start grario and chat with the model through a more friendly WebUI interface:

lmdeploy serve gradio ./workspace

Open the browser, enter the grario server address, and that’s it!

Inference performance

LMDeploy provides several ways to measure inference performance:

Static reasoning ability: Under the premise of fixing the number of batch, input and output tokens, measure the token generation speed (tokens/s).

Execute the profile_generation.py script to get the token generation throughput and video memory overhead under different input combinations. for example:

python profile_generation.py
    --model-path ./workspace \
    --concurrency 1 2 \
    --prompt-tokens 64 512 \
    --completion-tokens 512 512

The test results under 3090 are:

4291f75cf3d271927b6dde499abd17a8.png

Dynamic reasoning capability: Measure throughput (req/s) when requesting a variable length. This metric can be obtained through the profile_throughput.py script:

# Download conversation data
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Execute test script
python3 profile_throughput.py ShareGPT_V3_unfiltered_cleaned_split.json ./workspace --concurrency 4

The test results under 3090 are:

Throughput requests: 0.92 req/s, 55.17 req/min
Throughput tokens: 396.52 tokens/s
Throughput tokens(output only):189.46 tokens/s

We noticed that in LMDeploy’s GitHub issues and in the community, a question frequently mentioned is: how to adjust the batch size of inference. The answer lies in the converted model configuration file, which is workspace/triton_models/weights/config.ini.

There are three important parameters in config.ini, which are closely related to inference performance and graphics memory overhead.

max_batch_size = 32
cache_max_entry_count = 48
session_len = 2056
  • max_batch_size controls the maximum batch size of LMDeploy during inference. It must be strictly less than or equal to cache_max_entry_count.

  • cache_max_entry_count indicates the number of cached dialogue sequences. A conversation sequence represents a user’s ongoing chat record, containing multiple rounds of conversations.

  • session_len is the maximum length of the conversation sequence and is also the size of the context window. The larger the value, the more video memory the K/V cache of the dialogue sequence occupies.

When the video memory is insufficient, these three parameters can be adjusted to reduce the video memory overhead in the inference phase.

Build inference service

The inference service can be started with just one simple command.

lmdeploy serve api_server ./workspace

api_server provides a RESTful API compatible with the openai service interface. Enter http://{server_ip}:23333 in the browser, and you can read and test each API through swagger UI. As shown in the figure below, the first three APIs are consistent with the openai interface, and the last one is LMDeploy’s proprietary interactive mode reasoning interface. This interface caches user conversation history on the service side, thereby avoiding repeated context decoding of history records for each conversation, which helps improve reasoning efficiency.

52c8f1d0b9cefee402ebaf47506f5a14.png

We will provide a more detailed explanation of the application and technology of api_server in the future, so please continue to pay attention.

Conclusion

The graphics card architectures currently supported by LMDeploy 4bit inference are Ampere and Ada Lovelace, which are 30 series, 40 series, A10, A100 and other graphics cards.

In the upcoming v0.1 version, LMDeploy’s inference engine TurboMind will support 4-bit model inference on Turing architecture graphics cards. Everyone can happily play with InternLM-20B on their 20 series game cards.

In addition, the v0.1 version has more hard-core features to meet with you. Welcome to follow https://github.com/InternLM/lmdeploy to get the latest dynamic information as soon as possible!

Big model wants to take my job? Let’s take a look at this guide for squeezing AI wage earners first!

2023-11-03

8c7995e13b987b63661d00a6019299c3.jpeg

Can a big model become your personal doctor? CMB, the Chinese medical large model evaluation benchmark, has now been added to OpenCompass

2023-11-02

7efef62dd763c8d13d41097296503b49.jpeg

Which company has the best multi-round dialogue capabilities for large models? The open source automated evaluation benchmark is here!

2023-10-31

fbab3017a9aa6df6f64fe9f747980dc7.jpeg

0f500b63541d61cd2f315079b4092f5a.png

(Welcome to scan the QR code to join the group chat)

d6b8f4487e75301723de0c0b0ed6f780.gif

Click “Read the original text” below to go directly to LMDeploy, welcome to use