The big big model deployment plan throws bricks to attract jade

Author | Oldpan Editor | oldpan Blog

Click the card below to follow the “Automatic Driving Heart” public account

ADAS Jumbo dry goods, you can get it

Click to enter→The Heart of Autopilot【Model Deployment】Technical Exchange Group

Let me briefly talk about Large Model Deployment Scheme by means of hot spots. As an algorithm engineer who has only done CV deployment, under the background that LLM has gradually changed life recently, I suddenly realized that LLM deployment is also very important. of. Large models are very popular, and they are indeed useful (many vertical scenarios can be targeted for training), and there are gradually more large models combined with Vision. So how to deploy the large model is a super important engineering problem, and many companies are also working on it.

At present, the open source implementation with the best effect and the most discussions is LLAMA, so what I discuss here is also based on LLAMA’s magic modification deployment.

There are many finetune models based on LLAMA, such as vicuna-13b, which has the best open source effect, and alpaca-13b, which started experiments based on llama earlier. You can see:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard open source LLM
https://lmsys.org/blog/2023-05-03-arena/ Open source comparison based on llama
https://github.com/camenduru/text-generation-webui-colab some open source LLM notebooks

As for why LLAMA is chosen, because there are currently many models based on LLAMA, and the effect of the basic llama is also good. Of course, RWKV is also very good, and I have been watching it.

818d840d1978f747f08330ede 64b2c0f.png

See here for details, https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Most of the implementations on Github are direct python reasoning and display based on gradio’s web page, which cannot be called as a service (or not very elegant). Generally speaking, as a service:

The response speed should be fast, the more tokens per second the better
Stability, the emergence of core will not affect reasoning or can be restored in time
Support various forms, http, sse, grpc and other forms

In fact, the most important point is to be fast. Secondly, it must be quantized, because the weight of large models is often very large, and quantization can effectively reduce the weight (saving graphics cards), which is very necessary. To make it into a service, it takes a little effort and magic modification. Fortunately, there are already good demo implementations on the Internet for us to try. This article mainly summarizes the LLM deployment plan using LLAMA as an example, hoping to inspire others.

PS: Our commonly used ChatGPT is said to be built on the Python service. Compared with the time-consuming model, the overhead of the python environment is negligible.

Take LLAMA as an example

The LLM is very large, such as the large model of GPT3-175B, 700GB of parameters and 600GB of activation values during operation, a total of 1.3TB. Normally, it takes 32 A100-40GB to fit it.

But in practical applications, consumer-grade graphics cards are much cheaper than professional graphics cards (for example, 3090 has 24G memory compared to A10), so deploying LLM with consumer-grade graphics cards is also very profitable. If you can’t fit one card, put two. If you don’t have nvlink, then PCIE will do just fine.

Going back to the LLAMA model, there are 7B, 13B, 30B, and 65B versions. Of course, the larger version works best, and correspondingly requires more video memory (in fact, it is also possible to put it in memory or SSD, but the speed is more important than the Put it all in the video memory, it must be slow).

There are many implementations of LLAMA. Just list a few that I have seen, and you can refer to them:

https://github.com/juncongmoo/pyllama The implementation of the original llama
https://github.com/qwopqwop200/GPTQ-for-LLaMa supports quantization, llama quantized by INT4 and INT8
https://github.com/tpoisonooo/llama.onnx.git Run llama in ONNX mode

Quantization and precision

For consumer-grade graphics cards, FP32 can’t fit directly. Generally, the most basic one is FP16 (llama’s 7B, FP16 needs 14G of video memory, and most consumer-grade graphics cards have already said goodbye), while INT8 and INT4 quantization are very useful. , to give a few examples:

For 3080 graphics card, 10G video memory, then 13B INT4 is very cost-effective, and the accuracy is much higher than 7B-FP16
For 3090 graphics card, 24G video memory, then 30B INT4 can be deployed on a single 3090 graphics card, with higher accuracy

You can see the following figure, which lists the relationship between the scores and quantization accuracy of various open source pre-training models on various data sets:

$1a72a8d1c6ffd9cff55a652a68c065ee. png\$
I also tested several models myself. I used an A6000 graphics card with 48G of video memory. Based on GPTQ-for-LLAMA, I tested the PPL accuracy and required video memory size of various models with different specifications.
Execute the following command CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check to test different quantization Accuracy Indicators for models with different specifications:

# 7B-FP16
Median: 0.03220057487487793
PPL: 5.227280139923096
max memory(MiB): 13948.7333984375

#7B-INT8
Median: 0.13523507118225098
PPL: 5.235021114349365
max memory(MiB): 7875.92529296875

#7B-INT4
Median: 0.038548946380615234
PPL: 5.268043041229248
max memory(MiB): 4850.73095703125

#13B-FP16
Median: 0.039263248443603516
PPL: 4.999974727630615
max memory(MiB): 26634.0205078125

#13B-INT8
Median: 0.18153250217437744
PPL: 5.039003849029541
max memory(MiB): 14491.73095703125

#13B-INT4
Median: 0.06513667106628418
PPL: 5.046994209289551
max memory(MiB): 8677.134765625

#30B-FP16
OOM

#30B-INT8
Median: 0.2696110010147095
PPL: 4.5508503913879395
max memory(MiB): 34745.9384765625

#30B-INT4
Median: 0.1333252191543579
PPL: 4.526902675628662
max memory(MiB): 20070.197265625

30B of FP16 and 65B of burst video memory, I haven’t built my own multi-card environment yet, I will add the results to here later.

You can look at the test results of other big guys first, and there is probably an expectation. For the 65B model, INT4 can fit on two 3090s:

8044b58ee2c53ccc4ce4d61c7cc731a3 .png

Speed test

I also tested the implementation speed of llama’s Python library in huggingface, as well as the GPTQ-based quantization and multi-card speed. The stream mode refers to the Stream implementation in text-generation-webui.
Among them, FP16 is to directly call the unquantified FP16 inference in GPTQ-for-LLaMa/llama_inference.py, the calling method is model.generate, and the model comes from:

model = LlamaForCausalLM.from_pretrained(model, torch_dtype='auto',
                                        # load_in_8bit=True, device_map='auto'
                                        )

And the quantized version of the model:

The calling method of INT8 model is the same as that of FP16, the parameter load_in_8bit is set to True, and the implementation of INT8 in the transformer library of hugglingface is directly used
The INT4 model is quantified using GPTQ, and the specific code is accelerated through the triton language (different from the triton inference server I mentioned earlier)

Model	Platform	Accuracy	Video Memory	Speed	Remarks
llama-7B	A4000	FP16	13.6g	Output generated in 1.74 seconds (28.74 tokens/s, 50 tokens) Output generated in 9.63 seconds (25.97 tokens/s, 250 tokens)	GPTQ-for-LLaMa/llama_inference.py Tests include model pre- and post-processing 99% utilization
	A4000	4-bit	5g	Output generated in 2.89 seconds (17.51 tokens/s, 50 tokens)
	A4000	4-bit	5g	Output generated in 2.93 seconds (17.11 tokens/s, 50 tokens)	stream mode Output 1 token at a time
	A4000	4-bit	3.4 + 2.3g	Output generated in 2.91 seconds (17.31 tokens/s, 50 tokens)	Multi-card test dual A4000 The utilization rate of the two cards is about 20-30
	A4000	INT8	8.3g	Output generated in 10.20 seconds (5.8 tokens/s, 50 tokens) Implementation of huggingface for int8 implementation	Utilization rate is 25%

I tested it with my A4000. The test starts after the tokenizer encodes and ends after the tokenizer decodes it.
Rough conclusion:

FP16 is the fastest, because the quantization of INT4 and INT8 is not optimized (in theory, INT8 and INT4 are much faster than FP16), and the triton optimization of INT4 is obviously better than the implementation of INT8 in huggingface. It is recommended to use INT4 quantization
The speed of the stream mode and the normal model seems to be about the same

I’m too lazy to test the A6000. I’ll add an indicator found on the Internet. Compared with the A4000, the performance gap between the A6000 and the A4000 is similar to that of the 3090 and 3070…

Differences between LLM and ordinary small models in terms of deployment

In my past tasks, I dealt with CV-related models, such as detection, classification, identification, and key points. The largest model may only be 2 or 3G. The usual detection model is more than 300M, and it is smaller after FP16 or INT8 quantization, so generally speaking, there is no memory anxiety (of course, there are special cases, sometimes it may be deployed on one card. Models, such as autonomous driving scenarios or other industrial scenarios, at this time also need to allocate the memory usage of the model reasonably).

But deploying a large model like LLM is different. A random model of 6 or 7B has a weight of more than 20 G; a model of 65B or 175B has hundreds of G. The model becomes extremely large, because For this reason, the original deployment method has to undergo some changes.

Differences in models

First of all, the model is very large, and there are some problems in exporting the large model to ONNX. There are some restrictions on the weight of ONNX to save the large model.
The LLM model generally contains a lot of if-else branches, such as whether to use kv-cache, which is not very friendly to the conversion of some individual structure models (such as tensorrt)
We used to run on a single GPU. If there are multiple GPUs, many commonly used runtimes do not support them. Onnxruntime or tensorrt (tensorrt has support for internal testing of multiple GPUs) does not support multiple GPUs by default.
For large models, quantization is necessary, because FP16 or FP32 models require too much video memory, which is money. It is not easy to quantify, QAT is too expensive, and PTQ calibration also requires a lot of memory and video memory, and will use INT8 and INT4 to quantify
There are not many accelerated kernels for this type of model on the Internet, and there are few references, and many need to be written by yourself.

Differences in service methods

For small models, the inference speed is generally not too slow, basically within 500ms, and the result will be obtained after a short wait, that is, an ordinary http request, and a result can be obtained once for a request.

And because LLM is produced one by one, if you wait for all the tokens to come out before returning, then the user will have to wait for a long time, and the experience is not good, so the stream mode is generally used, which is to send a little and return a little, similar to a typewriter rush.

Discussion on deployment plan

This part is what this article mainly wants to talk about, and I also want to discuss with you, think about plans together, and throw bricks to attract jade.

For the deployment of large language models like LLAMA, I have no experience before. I have browsed some open source warehouses and materials, and I probably have some ideas. In a brief summary, there are several solutions:

Python-dependent solutions

Like the ordinary CV model, the python implementation is definitely the simplest and the most implemented. We can directly package a layer of service framework based on the existing open source implementation, which is similar to the flask service, but it also needs to have certain reliability.

So here you can choose the python backend of triton-inference-server, package a layer of pre- and post-processing by yourself, and support the stream mode (using grpc).

This is relatively simple to implement, just pay more attention to the input and output. Compared with CV, our input can be text or input_ids, which must be distinguished from the unchar of the image. The acceleration part can only be implemented by python, and also depends on the python environment. .

d1a0bb3d024522277b7ca8f666ae2c 24.png

fastertransformer_backend scheme

For the deployment of production environment, you can use triton inference server, and then there is fastertransformer-backend based on tritonserver. fastertransformer-backend is a backend that supports multiple LLM models. It manually implements many high-performance operators for transformers. Each model is built with handwritten cuda, and its performance is higher than that of using libraries. But the price is that it is more difficult to support the new model framework, and a lot of source code needs to be modified.

NVIDIA Triton introduces Multi-GPU Multi-node inference. It uses model parallelism techniques below to split a large model across multiple GPUs and nodes:

Pipeline (Inter-Layer) Parallelism that splits contiguous sets of layers across multiple GPUs. This maximizes GPU utilization in single-node.
Tensor (Intra-Layer) Parallelism that splits individual layers across multiple GPUs. This minimizes latency in single-node

Fortunately, there are many bigwigs in the open source community, and the recent unofficial PR also supports LLAMA. I tried to run it myself, and the speed is 20% faster than the first implementation of huggingface. This precision is based on FP16 and supports multiple cards. Currently, INT8 and INT4 are not supported.

Divide and conquer solutions using accelerated libraries

We know that LLAMA’s 7B model contains 32 decoders with the same structure:

# transformers/src/transformers/models/llama/modeling_llama.py
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])

Therefore, we can also deploy these 32 sub-models using our commonly used acceleration libraries. For example, a large model of 7B can be split into sub-models of more than 300 M each. TensorRT can be used to easily convert them, and some big guys are already doing it:

https://github.com/tpoisonooo/llama.onnx

The conversion of 7B’s llama model to ONNX is probably the following:

decoder-merge-0.onnx

embed.onnx

head.onnx

norm.onnx

If they are connected together, it is also possible to put these models on different graphics cards. For example, two graphics cards, the first card puts 15 sub-models, and the second card puts the remaining 17 sub-models. It is also possible to form a pipeline parallelism.
There are a few points to note:

The acceleration library can be used not limited to TensorRT, such as TVM, AItemplate, etc.
A backend is needed to string all the submodels together, preferably implemented in C++
For kv-cache, how to store it needs to be considered

You can use triton-inference-server to organize pipelines, and instances of different sub-models can be placed on different GPUs.

Postscript

Let’s talk about this for now, this article will be updated at any time later. At present, I simply list my own ideas. If you have good ideas, you are welcome to communicate with Lao Pan.

There are too many new things and new technologies released every day, I can’t read it, and this article has been delayed for a long time. I have tried the above three solutions myself, and they are all feasible. If you are interested, you can try it. The news is that TensorRT has already silently supported multi-card inference, and it may come out next month (June) at the earliest (probably an external release). I don’t know how about the TensorRT large model version?

$deb739279f79eabe6d9427cd3373 2816.png\$

Reference link

https://github.com/huggingface/text-generation-inference
https://github.com/huggingface/chat-ui/issues
https://github.com/ELS-RD/transformer-deploy

(1) The video course is here!

The Heart of Autonomous Driving brings together millimeter-wave radar vision fusion, high-precision maps, BEV perception, sensor calibration, sensor deployment, autonomous driving cooperative perception, semantic segmentation, autonomous driving simulation, L4 perception, decision planning, trajectory prediction Waiting for learning videos in multiple directions, everyone is welcome to pick them up (scan code to enter learning)

(Scan the code to learn the latest video)

Video official website: www.zdjszx.com

(2) The first autonomous driving learning community in China

A communication community of nearly 1,000 people, and 20+ autonomous driving technology stack learning routes, want to learn more about autonomous driving perception (classification, detection, segmentation, key points, lane lines, 3D object detection, Occpuancy, multi-sensor fusion, object tracking , optical flow estimation, trajectory prediction), automatic driving positioning and mapping (SLAM, high-precision map), automatic driving planning and control, field technical solutions, AI model deployment implementation, industry trends, job releases, welcome to scan the QR code below, Join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, communicate with industry leaders about various problems in getting started, studying, working, and job-hopping, and share papers + codes + videos every day, Looking forward to the exchange!

(3)[The Heart of Autopilot] Full Stack Technology Exchange Group

The Heart of Autopilot is the first autopilot developer community, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-sensor fusion, SLAM, optical flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, product manager, hardware configuration, AI job exchange, etc.;

Add the autobot assistant WeChat invitation to join the group

Remarks: school/company + direction + nickname