(NVIDIA graphics card available under Windows) ChatGLM-6B installation document (comparable to GPT-3.5), save one!

Directory

I. Introduction

2. Download (It is recommended to run the python program directly to download, the latest and most stable)

3. Deployment

3.1 Configuration environment

3.2 Start the demo program

3.2.1 Start cli_demo.py

3.2.2 Start web_demo.py

3.2.3 Start web_demo2.py

4. [Latest] ChatGLM-6B-int4 version tutorial

4.1 download

4.2 Configuration environment, NVIDIA graphics cards are available under Windows

4.3 Start the demo program

V. Summary

Reprinted: Tsinghua ChatGLM-6B Chinese Dialogue Model Deployment Simple Tutorial_—Olive—‘s Blog-CSDN Blog

1. Foreword

Recently, Tsinghua University has open sourced ChatGLM-6B, a small-parameter version of its Chinese dialogue model (GitHub address: https://github.com/THUDM/ChatGLM-6B). It can not only be deployed on a personal computer with a single card, but even INT4 quantization can be deployed on a computer with a minimum of 6G video memory, and of course the CPU can also be used.
With the development of the general dialogue boom of large language models, the huge amount of parameters also makes these models can only be deployed online or provide API interfaces on the platforms of large companies. Therefore, the open source and deployment of ChatGLM-6B on personal computers are of great significance.
After testing, the blogger found that compared to other models with the same parameters on huggingface, the effect of ChatGLM-6B is already very good, not to mention that it also has a 130B version, as explained on the official website (official blog: https://chatglm .cn/blog) is better than GPT-3.5 (the 130B version is in internal testing, and the blogger has not obtained the test qualification, so it cannot be confirmed). So it is still fun to deploy ChatGLM-6B on a personal computer or server, what kind of bicycle is needed for this parameter.

The latest update] ChatGLM-6B added the quantized INT4 model in the 2023/03/19 update, and the official directly targeted quantized model is available for download. Compared with the original version, the quantization effect is better, and the model size is only 4G, which greatly speeds up the download speed. For students who only have a CPU or only 6G video memory, you can directly choose to download and deploy the quantized model. This article updates the deployment tutorial of the ChatGLM-6B-int4 version separately. In Chapter 4, those who need to deploy can directly jump to the fourth Chapter, ignore the previous content. huggingface address: https://huggingface.co/THUDM/chatglm-6b-int4

2. Download (It is recommended to run the python program directly to download, the latest and most stable)

1. The model file needs to be downloaded on huggingface: marksc/chatglm-6b-int4 at main
Click [Files and versions] to download the file. It is recommended to download to a new folder, such as the large folder is ChatGLM, put the model file in the model folder, the overall structure is … /ChatGLM/model.
2. If the download speed of the model files (larger than 1G) is slow, you can download these model files separately from domestic sources (files that are not available in other sources still need to be downloaded on huggingface): Tsinghua University cloud disk
3. After the download is complete, make sure that the following files are all in the model folder (for example, stored in …/ChatGLM/model):

4. Download other environment configuration files and demo program codes from GitHub. GitHub address: GitHub – THUDM/ChatGLM-6B: ChatGLM-6B: Open Source Bilingual Dialogue Language Model | An Open Bilingual Dialogue Language Model. Download it to the directory …/ChatGLM/.

3. Deployment

To deploy the model locally, you need to install the affected libraries in the Python environment. In addition, you need to install the corresponding version of cuda and the corresponding Pytorch for the GPU. After modifying the demo file, it can be started and run.

3.1 Configuration Environment

1. Install the cuda corresponding to your own GPU. There are many online tutorials, so I won’t repeat them here. (skip this step if only cpu is available)
2. According to the cuda version installed in the previous step, download and install the corresponding version of pytorch. There are also many tutorials on the Internet. (If you only have cpu, you also need to install the cpu version of pytorch)
3/ After the above two steps are completed, open the command line terminal in the …/ChatGLM/ directory and enter:

pip install -r requirements.txt

After pressing Enter, pip will automatically download and install the relevant dependent libraries.

3.2 Start demo program

There are two demo codes in the …/ChatGLM/ directory: (1) cli_demo.py, directly enter the question and answer in the command line; (2) web_demo .py, using the gradio library to generate question-and-answer pages. Another one is web_demo2.py

The first demo is convenient and can clear the history, but it is easy to enter some strange characters in the command line (especially the Linux command line), which will cause the program to stop unexpectedly;

The interface of the second demo is simple, but the records cannot be cleared, and if it is used in a Linux system server without a graphical interface, the port needs to be mapped to the local computer, and then open the browser to access. My personal suggestion is that if you have the ability, you can synthesize the two and write it yourself. For example, using jupyter can combine the two well, and you can also render the output in markdown to make the code or formula look better.

3.2.1 Start cli_demo.py

Modify the model path. Edit the cli_demo.py code, modify the model folder path in lines 5 and 6, and replace the original “THUDM/ChatGLM-6B” with “model”< /strong> will do. (This model is the model folder)

Modified quantized version. If your video memory is larger than 14G, you can skip this step without quantization. If your video memory is only 6G or 10G, you need to add quantize(4) or quantize(8) on the sixth line of code, as follows:
# 6G video memory 4 bit quantization possible model = AutoModel.from_pretrained("model", trust_remote_code=True).half().quantize(4).cuda() # 10G video memory can be quantized by 8 bit model = AutoModel.from_pretrained("model", trust_remote_code=True).half().quantize(8).cuda()

Execute the python file, you can enter in the command line terminal:
python cli_demo.py
You can start the demo and start using it!

3.2.2 Starting web_demo.py

Install the gradio library, open the command line terminal in the ChatGLM directory, and enter:
pip install gradio
The libraries required by the demo can be installed.

Modify the model path. Edit the web_demo.py code, modify the model folder path in lines 4 and 5, and replace the original “THUDM/ChatGLM-6B” with “model”< /strong> will do.

Modified quantized version. If your video memory is larger than 14G, you can skip this step without quantization. If your video memory is only 6G or 10G, you need to add quantize(4) or quantize(8) on the fifth line of code, as follows:
# 6G video memory 4 bit quantization possible model = AutoModel.from_pretrained("model", trust_remote_code=True).half().quantize(4).cuda() # 10G video memory can be quantized by 8 bit model = AutoModel.from_pretrained("model", trust_remote_code=True).half().quantize(8).cuda()

Execute the python file, you can enter in the command line terminal:
python web_demo.py
You can start the demo and start using it!

3.2.3 Start web_demo2.py

#Add a steamlit based demo web_demo2.py for better UI. #Need to install streamlit and streamlit-chat component fisrt: pip install streamlit pip install streamlit-chat #then run with the following command: streamlit run web_demo2.py --server.port 6006

other same

4. [Latest] ChatGLM-6B-int4 version tutorial

ChatGLM-6B-INT4 is the quantized model weight of ChatGLM-6B. Specifically, ChatGLM-6B-INT4 performed INT4 quantization on the 28 GLM Blocks in ChatGLM-6B, and did not quantify Embedding and LM Head. Theoretically, the quantized model can be inferred with 6G video memory (using CPU or memory), and it is possible to run on embedded devices (such as Raspberry Pi).

4.1 download

Open the GitHub page of ChatGLM-6B (GitHub – THUDM/ChatGLM-6B: ChatGLM-6B: Open Source Bilingual Dialogue Language Model | An Open Bilingual Dialogue Language Model), download all files to the folder …/ChatGLM/.

Create a new folder under …/ChatGLM/ …/ChatGLM/model . Open the huggingface page (THUDM/chatglm-6b-int4 at main), download the int4 quantized model of ChatGLM-6B-int4, and download all model files to the …/model directory.
Note: You have to click on these down arrows to download, as follows:

So far, all the files have been downloaded. The large folder …/ChatGLM/ contains demo and configuration environment related codes, and contains a small folder …/ChatGLM/model, and the model folder stores model-related files.

4.2 Configuration environment, NVIDIA graphics cards are available under Windows

If there is no 6G video memory, you need to use the CPU to run, the model will automatically compile the CPU Kernel according to the hardware, please make sure you have installed GCC and OpenMP (Linux is generally installed, for Windows you need to manually install strong>) for optimal parallel computing power. You can try the following two methods to install GCC and OpenMP (recommended the second is faster):
Installation and use of GCC under Windows_windows gcc_ Maruko loves to learn! Blog-CSDN Blog
OpenMP Windows/macOS Configuration Guide_windows install openmp_Dreamer_bLue’s Blog-CSDN Blog

If you have 6G video memory, you need to install cuda corresponding to the graphics card version, and then install pytorch corresponding to the cuda version. There are many tutorials on installing cuda and pytorch on the Internet, so I won’t repeat them here. (Students running on CPU skip step 2)

To install related dependencies, open a command line terminal in the …/ChatGLM/ directory, enter
pip install -r requirements.txt

and press Enter, and pip will automatically download and install related dependent libraries.

My computer has an Intel graphics card instead of an NVIDIA graphics card Can I use CUDA?
No. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It is based on NVIDIA’s GPU architecture and can only run on NVIDIA graphics cards. Therefore, if your computer has an Intel graphics card instead of an NVIDIA graphics card, you cannot use CUDA.

However, if you need parallel computing on a computer with an Intel graphics card, you can consider using other parallel computing platforms and programming models, such as OpenCL, OpenACC, OpenMP, etc. These platforms and programming models can run on computers that support Intel graphics, and can also be used to perform parallel computing and accelerate various applications.

NVIDIA graphics card download driver: https://www.nvidia.com/Download/index.aspx
NVIDIA Graphics Card Download CDUA: CUDA Toolkit 12.1 Downloads | NVIDIA Developer

4.3 Start demo program

The content of this section is basically the same as the content of 3.2, just refer to part of the content of 3.2, so I won’t go into details here. Please note:↓↓↓
Different from Section 3.2, step 2 of 3.2.1 and step 3 of 3.2.2 can be ignored directly, because the model has already been quantized and does not need to be quantified again.

5. Summary

After using ChatGLM-6B, Wenxin Yiyan and ChatGPT for a period of time, the gap between the former two and ChatGPT in text dialogue is not big, and there is still a certain gap in code generation capabilities, but they are better than GPT-3.5.
Wenxinyiyan is better than ChatGLM-6B in most cases, but it should be noted that ChatGLM-6B only has 6 billion parameters, and it can be deployed on a single card. Optimistic, the official also said that in addition to int4 quantization, the model will be further compressed.
In general, ChatGLM-6B can crush other dialogue models under the same parameters, and can be deployed on a personal computer, or with Huawei’s free GPU. After a few days of experience, ChatGLM-6B is very surprising in the dialogue model, so I recommend everyone to deploy and play. You can even consider deploying a wave of embedded devices, and look forward to the official further extreme compression!
Finally, I wish ChatGLM and Wenxinyiyan can continue to work harder. The recent experience also feels that the official is updating and improving every day, which shows that the attitude is still very positive.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeArtificial intelligenceDeep learning 258479 people are studying systematically