GPT practical series-ChatGLM3 local deployment CUDA11+1080Ti+ graphics card 24G practical solution

Table of Contents

1. ChatGLM3 model

2. Resource requirements

3. Deployment and installation

Configuration Environment

Installation process

Low-cost configuration and deployment solution

4. Start ChatGLM3

5. Functional testing


Freshly released, the domestic GPT version has been iteratively updated~ The Tsinghua team has just released ChatGLM3, and just before the Yunqi Conference, Baichuan also released Baichuan2-192K, which has caused another wave of domestic AI large models.

With the financing of the two companies in place, domestic large-scale model research and open source activities are progressing in full swing. At present, more and more companies and research institutions are beginning to open source their large models. The well-known ones in China include Alibaba’s Tongyi large model series, Huawei’s Pangu large model series, Tencent’s Hunyuan large model series, etc. Home. However, because these large open-source models have extremely high parameters and calculations, and require a large amount of data and computing power support, only a few large technology companies and research institutions can develop and maintain these large models.

Of course, how large models are deployed and used in small graphics memory is also very important, so that the value of AI can be applied and extended to a wider range of fields. This practical column continues to focus on small but beautiful large models, and will evaluate a series of open source models, especially domestic large models. The focus is on privatization and lightweight deployment. For example, the GPU resources required for inference are controlled within 24G video memory, 2 pictures Titan XP, or 1080 Ti.

1. ChatGLM3 model

On October 27, Zhipu AI released its self-developed third-generation dialogue model ChatGLM3 at the 2023 China Computer Conference (CNCC). This is the third time that Zhipu AI has released the ChatGLM base model this year. Zhipu AI announced that in order to further promote the open source ecosystem, it will open source multiple models including ChatGLM3-6B, ChatGLM3-6B-32K, CogVLM-17B and AgentLM.

ChatGLM3-6B is the third generation version of the open source Chinese-English bilingual conversation model ChatGLM-6B. It quickly attracted attention in the industry as soon as it was launched. Based on the latest efficient dynamic inference and memory optimization technology, ChatGLM3-6B is still:

  1. More powerful: Compared with the second-generation ChatGLM model, ChatGLM3 performed well in 44 Chinese and English public data set tests, ranking first among domestic models of the same size. Evaluation results show that ChatGLM3 has achieved significant performance improvements in benchmark tests such as MMLU, CEval, GSM8K and BBH, with improvements of 36%, 33%, 179% and 126% respectively. Thanks to its original multi-stage enhanced pre-training method, richer training data and better training plans.

  2. More complete: ChatGLM3-6B adopts a newly designed Prompt format, in addition to normal multi-turn conversations. At the same time, it natively supports complex scenarios such as tool calling (Function Call), code execution (Code Interpreter), and Agent tasks.

  3. More efficient: ChatGLM3 can also be deployed on mobile phones as end-side models ChatGLM3-1.5B and 3B. It supports a variety of mobile phones and vehicle platforms including vivo, Xiaomi, and Samsung, and even supports CPU chips on mobile platforms. reasoning, the speed can reach 20 tokens/s. In terms of accuracy, the 1.5B and 3B models perform close to the ChatGLM2-6B model on public benchmarks. This innovation provides a convenient way to deploy natural language processing applications on mobile devices, further expanding the application scope of natural language processing technology.

ChatGLM3’s semantic and logical capabilities have been greatly enhanced:

  • The most eye-catching is the multi-modal understanding capability of CogVLM-image recognition semantic function, which has achieved SOTA on more than 10 international standard image and text evaluation data sets.

  • The code enhancement module Code Interpreter generates and executes code according to user needs, automatically completing complex tasks such as data analysis and file processing;

  • Web search enhancement WebGLM – Access search enhancement, which can automatically search for relevant information on the Internet based on questions and provide references to relevant literature or article links when answering.

  • ChatGLM3 currently has a new Agent intelligence capability, which integrates the self-developed AgentTuning technology and activates the model’s intelligent agent capabilities.

  • In terms of intelligent planning and execution, ChatGLM3 has improved by 1000% compared to the second generation of ChatGLM. This technology opens up a new model agent capability, allowing ChatGLM3 to perform well in more complex scenarios. For example, ChatGLM3 can natively support complex scenarios such as tool invocation, code execution, games, database operations, knowledge graph search and reasoning, and operating systems.

  • The CogVLM model improves Zhipu Qingyan’s Chinese image and text understanding capabilities and achieves image understanding capabilities close to GPT-4V. It can answer various types of visual questions, and can complete complex target detection and labeling to complete automatic data annotation.

  • With search enhancement capabilities, it can help users sort out online literature or article links for related questions and sort out answers, which means that Zhipu Qingyan will provide users with better natural language processing services.

ChatGLM has also done a lot of work in supporting domestic chips. Since the beginning of 2022, the ChatGLM series models have supported large-scale pre-training and inference on Ascend, Sunway Supercomputing, and Haiguang DCU architectures. So far, it has supported more than 10 types of domestic hardware. The ecosystem includes Shengteng, Shenwei Supercomputer, Haiguang DCU, Haifeike, Muxixiyun, Computing Technology, Tianshu Intelligent Core, Cambrian, Moore Thread, Baidu Kunlun Core, Lingxi Technology, Great Wall Super Cloud, etc. Through joint innovation with domestic chip companies, the performance of the ChatGLM series models has been continuously optimized, and the domestic hardware ecosystem has also gained a closed loop of localization of large models.

2. Resource requirements

Code download link:

Github address git clone https://github.com/THUDM/ChatGLM3

Model file type and acquisition

The 6B Chinese-English dialogue model of ChatGLM3 is divided into 6B, 6B-Base, and the context length is 8K.

ChatGLM3’s 6B-32k, context length 32K.

There are 3 types in total, you can choose according to your own situation. See the table for how to obtain the model.

Model Seq Length Download
ChatGLM3-6B 8k HuggingFace | ModelScope
ChatGLM3-6B-Base 8k HuggingFace | ModelScope
ChatGLM3-6B-32K 32k HuggingFace | ModelScope
GPU resource requirements for inference

Actual test data for simple test examples

Quantization level Generate minimum video memory of 8192 length
FP16 15.9 GB
INT8 11.1 GB
INT4 8.5 GB

Open source projects to accelerate ChatGLM3:

  • chatglm.cpp: A quantitative accelerated reasoning solution similar to llama.cpp, enabling real-time conversations on laptops

3. Deployment and installation

Configuration environment

The project deploys the non-quantized version on local ordinary equipment.

Graphics card dual graphics card Nivdia Titan XP, each 12G, total 24G graphics card

Operating system: ubuntu 20.04

python version 3.10, version 3.8 or above recommended

pytorch 2.01, version 2.0 and above recommended.

CUDA 12.0, driver 525, it is recommended to use version 11.4 and above.

Installation process
Create virtual environment
conda create -n chatglm3 python==3.10.6 -y
conda activate chatglm3
Install ChatGLM3 dependent supporting software
git clone https://github.com/THUDM/ChatGLM3
cd ChatGLM3
pip install -r requirements.txt -i https://mirror.sjtu.edu.cn/pypi/web/simple

Among them, the recommended version of the transformers library is 4.30.2, and torch is recommended to use version 2.0 and above to obtain the best inference performance.

Download model file

The model weights, source code, and configuration required for inference have been published in ModleScope and Hugging Face, see the download link above.

The code will automatically download the model implementation and parameters by transformers. The complete model is implemented in Hugging Face Hub.

In addition, the model weight is relatively large. If your network environment is poor, downloading model parameters may take a long time or even fail. If the network abroad is not good, it is recommended to download it from the domestic ModelScope or download it manually from Tsinghua Cloud. And replace the downloaded file to the local THUDM/ChatGLM3 directory. When loading a local model, you need to change it to the local directory of the model. For example, replace THUDM/ChatGLM3 in the above code with the path of your local ChatGLM3 folder, and you can load it from the local directory. Load the model.

Low-cost configuration and deployment solution
Quantitative model loading

By default, the model is loaded at FP16 precision, which requires approximately 16 GB of video memory to run the above code.

If there is insufficient video memory:

torch.cuda.OutOfMemoryError: CUDA out of memory.

If the GPU memory is limited, you can try to load the model in a quantized manner, such as INT8 quantization. The usage is as follows:

model = AutoModel.from_pretrained("THUDM/chatglm3-6b",load_in_8bit=True, trust_remote_code=True).cuda()

Model quantization will bring a certain performance loss. After testing, ChatGLM2-6B can still generate naturally and smoothly under 8-bit quantization.

Multiple card deployment and loading

If there are multiple GPUs, but the memory size of each GPU is not enough to accommodate the complete model, the model can be divided into multiple GPUs. First install accelerate: pip install accelerate, and then load the model as follows:

from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)

The model can be deployed on two GPUs for inference. You can change num_gpus to the number of GPUs you wish to use. The default is to split evenly, you can also pass in the device_map parameter to specify it yourself.

CPU deployment

Without GPU hardware, inference can also be performed on the CPU, but the inference speed will be slower. How to use it is as follows (requires approximately 32GB of memory or more)

model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).float()

If there is insufficient memory, you can also use the quantized model

model = AutoModel.from_pretrained("THUDM/chatglm2-6b",load_in_8bit=True, trust_remote_code=True).float()

Open source projects to accelerate ChatGLM3:

chatglm.cpp: A quantitative accelerated reasoning solution similar to llama.cpp, enabling real-time conversations on laptops

Running the quantized model on the CPU requires the installation of gcc and openmp. It is installed by default on most Linux distributions. For Windows, check openmp when installing TDM-GCC. The Windows test environment gcc version is TDM-GCC 10.3.0, and Linux is gcc 11.3.0.

4. Start ChatGLM3

Command Line Dialog Interface
python cli_demo.py

The program will conduct an interactive conversation in the command line. Enter instructions in the command line and press Enter to generate a reply. Enter clear to clear the conversation history. Enter stop to terminate. program.

Web page demo
python web_demo.py

You can start the web version demo based on Gradio through the command. A web service will be started locally. You can access it by putting the address given by the console into the browser.

You can also start the web version demo based on Streamlit with the following command:

streamlit run web_demo2.py

The web version demo will run a Web Server and output the address. Open the output address in a browser to use it. After testing, the web version Demo based on Streamlit will be smoother.

API Deployment

The implemented OpenAI format streaming API deployment can be used as the backend of any ChatGPT-based application. It can be deployed by running openai_api.py in the warehouse:

python openai_api.py

Sample code to make an API call is

import openai
if __name__ == "__main__":
    openai.api_base = "http://localhost:8000/v1"
    openai.api_key = "none"
    for chunk in openai.ChatCompletion.create(
        model="chatglm3-6b",
        messages=[
            {"role": "user", "content": "Hello"}
        ],
        stream=True
    ):
        if hasattr(chunk.choices[0].delta, "content"):
            print(chunk.choices[0].delta.content, end="", flush=True)

5. Functional Test

Know yourself question: Who are you

User: Hello

ChatGLM: Hello! I’m the artificial intelligence assistant ChatGLM3-6B, nice to meet you and feel free to ask me any questions. ?

User: What are the fun places in Europe?

ChatGLM: Europe is a vast country with many interesting places. Here are some recommended European tourist attractions for you to visit:

1. France: Paris, Eiffel Tower, Louvre Museum, Lyon, Marseille, Champagne wine producing areas, etc.

2. Italy: Rome, Colosseum, Vatican, Florence, Venice, Naples, etc.

3. United Kingdom: London, Big Ben, Tower Bridge, Thames River, Manchester, Edinburgh, etc.

4. Germany: Berlin, Berlin Wall, Berlin Cathedral, Cologne, Munich, etc.

5. Spain: Barcelona, Seine River, Catalonia Plains, Madrid, Granada, etc.

6. Portugal: Lisbon, Belem Air Castle, Sintra, etc.

7. Netherlands: Amsterdam, windmills, Haarlem, canals in the center of Amsterdam, etc.

8. Norway: Oslo, Brittvik, Tromso, Niedroder, etc.

9. Sweden: Stockholm, Karolinska AI, Gothenburg, Sweden, etc.

10. Finland: Helsinki, Arctic Circle, Lake Finland, Kemi and other places.

The above are just some of the many tourist attractions in Europe. You can also choose other places to travel according to your interests and time. ?

User: Check what the weather is like in Paris

ChatGLM: I cannot check the weather in real time. I suggest you check the local weather forecast or use a weather app to get the latest Paris weather information.

Let’s enjoy it and let ChatGLM with different quantification levels write poems to see what their abilities are.

Poetry written in the default version

By default, no quantization parameters are added.

User: Help write a prose poem with the theme of taking root. One day, we will become towering trees.
?
ChatGLM: Take root in the earth and absorb nutrients,
In the baptism of years, we grow.
From buds to branches, from weak to tough,
We slowly become a towering tree.
?
Our roots are deep in the soil,
The veins are vertical and horizontal, absorbing nutrients from the earth.
Our branches are lush,
The greenery adds color to the earth.
?
The storm is coming, we hold on,
The earthquake comes, but we remain standing.
We are like a mountain,
Standing firm, supporting the weight of the earth.
?
One day, we will become towering trees,
Our branches are strong, our leaves are leafy and our roots are deep.
We will become the guardians of the earth,
Provide shelter for the land we love.
?
Taking root is a kind of faith,
It is a kind of persistence and a kind of responsibility.
We will continue to grow,
As time goes by, it becomes stronger.
?
One day, we will become towering trees,
Our influence and value,
Will be remembered by future generations.
We will always be rooted in the earth,
Bringing hope of life to the earth. 

BF16 quantized version

User: Help write a prose poem with the theme of taking root. One day, we will become towering trees.
?
ChatGLM: Take root in the earth and absorb nutrients,
The vicissitudes of time pass by,
Our bodies gradually grow stronger,
Just like the earth resists the wind and sand, it is tenacious in life.
?
From childhood to adulthood, we grew up with you,
From buds to branches, from innocence to maturity,
In the storm of life, we move forward with determination,
No matter how hard it is, we never give in.
?
One day, we will become towering trees,
Our roots go deep into the soil,
Our branches and leaves are lush and prosperous,
Our strength supports the sky.
?
We will accompany the next generation to grow up,
Our seeds inherit hope,
Our roots are always rooted in the earth,
Our soul is always connected to the sky.
?
Taking root is a kind of faith and a kind of persistence.
It is a kind of strength and a kind of courage,
We will become towering trees, firm and majestic,
Our lives will always be rooted in the earth. 

Poetry written in INT8 quantified version**

User: Help write a prose poem with the theme of taking root. One day, we will become towering trees.
?
ChatGLM: Take root in the earth and absorb nutrients,
In the long river of time, we grow.
From childhood to adulthood, we have experienced ups and downs,
Never flinch, never fear, because we know,
One day, we will become towering trees.
?
Our roots are deep in the earth,
Absorbing sunlight and rain,
In fertile soil, thrive.
We work hard to grow,
Until it grows into a big tree that reaches into the sky,
Add a touch of green vitality to the earth.
?
Towering trees, we are more than just a miracle of nature;
It is also a great testimony of life.
Our roots are deep in the earth,
Carrying countless stories and hopes,
Like the guardian of the earth,
Forever standing.
?
We will become towering trees,
For the land we love,
For the life we cherish,
For the goal we strive for,
Provide unwavering support and shelter.
?
One day, we will become towering trees,
Become the most dazzling scenery on earth,
Our roots, our branches,
We will build a better future together. 

Poem written in INT4 version

User: Help write a prose poem with the theme of taking root. One day, we will become towering trees.
?
ChatGLM: Take root deep in the earth,
Our root system is strong and steady.
Like a big tree, we also need to grow,
The branches and leaves are luxuriant and become the pillar material.
?
In the soil of life, we absorb nutrients,
In the sunshine of the years, we thrive.
In the wind and rain, we stand firm,
No matter how high the sky is, we can reach it.
?
Our roots go deep into the earth,
Absorbing the nutrients of life, it grows stronger day by day.
One day, we will become towering trees,
Provide shelter and support for the world and mankind.
?
Like a big tree, we also need to grow,
Become a witness to life and a symbol of beauty.
Our roots go deep into the earth,
Absorbing the nutrients of life, it grows stronger day by day.
?
One day, we will become towering trees,
Provide shelter and support for the world and mankind. 

Writing poetry may still be relatively simple, but you can also see the semantic distortion of sentences after quantification.

Which version of the poem do you like? If you like it, rate it 1, if you don’t like it, rate it 2

If you find it useful, please give it a like + bookmark it

End

Column article:

GPT practical series-P-Tuning localized training ChatGLM2 and other LLM models, what exactly did it do? (one)

GPT Practical Series-Interpretation of Fine-tuning Training Parameters of ChatGLM2 Model

GPT Practical Series-How to fine-tune ChatGLM2 model training with your own data

GPT practical series-ChatGLM2 deploys Ubuntu + Cuda11 + video memory 24G practical solution

GPT practical series-Baichuan2 localized deployment practical plan

GPT practical series-ChatGLM3 local deployment CUDA11 + 1080Ti + graphics card 24G practical solution