1 Preface
Recently, I have been doing alchemy (AIGC), and suddenly found that the business code is boring…
Last time I published an article on AI drawing. Although ChatGPT cannot be deployed by itself, there are still many open source LLMs. As long as there is a similar graphics card, it is no problem to deploy LLM locally.
This article will introduce the local deployment of the following two domestic open source LLMs
-
ChatGLM-6B
-
MOSS
This article simply runs the model first, and then I will continue some of my recent exploration records in the direction of LLM~
2 concepts
Before we start, let’s look at some basic concepts.
AIGC
Quoting the content of mbalib below
AIGC (AI Generated Content) is artificial intelligence generated content, also known as “Generative AI” (Generative AI), which is considered to be the A new content creation method after Produced Content (PGC) and User Generated Content (UGC).
The Internet content production method has gone through the process of PGC-UGC-AIGC. PGC (Professional Generated Content) is professionally produced content, such as text and video produced by professionals in the Web1.0 and broadcasting industry, which is characterized by professionalism and guaranteed content quality. UGC (User Generated Content) is user-generated content, which is produced with the concept of Web2.0. It is characterized by users’ free upload of content and rich content. AIGC (AI Generated Content) is content generated by AI, which is characterized by automated production and high efficiency. With the maturity of natural language generation technology NLG and AI models, AIGC has gradually attracted everyone’s attention, and now it can automatically generate text, pictures, audio, video, and even 3D models and codes.
Recently, many ChatGPT and AI drawing belong to this field.
LLM
Quote from wikipedia below
A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabeled text using self-supervised learning or semi-supervised learning. LLMs emerged around 2018 and perform well at a wide variety of tasks. This has shifted the focus of natural language processing research away from the previous paradigm of training specialized supervised models for specific tasks.
Chinese is also called “Large Language Model”. Now the very popular ChatGPT is the representative of this LLM. The large model has a key attribute: the number of parameters. The size of the parameter determines the ability of the large model (it cannot be said to be absolute, but it must is positively correlated).
The following are the parameter quantities of common LLMs:
LLM name | Amount of parameters |
---|---|
ChatGPT 3.5 | 175B |
ChatGLM | 6B |
MOSS | 16B |
LLaMA | 7B/13B/33B/65B |
The length of the relationship only lists these few, and more can be found in the reference materials at the end of the article.
3 Build environment
Hardware
First, there must be a Linux system server/computer equipped with an NVIDIA graphics card.
The video memory needs to reach 8G and above, otherwise it won’t work~
The system recommends using the latest Ubuntu (22.04) or its derivatives. The following are the two server configurations I used during the test.
server 1
-
CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
-
Memory: 64G
-
Graphics card: NVIDIA GeForce RTX 2080 Ti
server 2
-
CPU: Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2
-
Memory: 128G
-
Graphics card: Tesla T4 x4
Software
After talking about the hardware, let’s look at the software.
Driver
First of all, you need a graphics card driver. It is easier to install a graphics card driver in Ubuntu-based distributions than to drink water. This is why Ubuntu is recommended for alchemy.
PS: It can be completed with one click. You don’t need to go to the Internet to check what blogs you have copied hundreds of times, then download a bunch of things, compile and uninstall nouveau~
The Ubuntu desktop version can directly use the “Software Update” App to install the graphics card driver with one click.
Ubuntu server version, use the nvidia-detector
command to detect the driver version that needs to be installed, for example:
$ nvidia-detector nvidia-driver-530
Use ubuntu-drivers list
to get the list of installable drivers, example:
$ ubuntu-drivers list nvidia-driver-418-server, (kernel modules provided by nvidia-dkms-418-server) nvidia-driver-530, (kernel modules provided by linux-modules-nvidia-530-generic-hwe-22.04) nvidia-driver-450-server, (kernel modules provided by linux-modules-nvidia-450-server-generic-hwe-22.04) nvidia-driver-515, (kernel modules provided by linux-modules-nvidia-515-generic-hwe-22.04) nvidia-driver-470-server, (kernel modules provided by linux-modules-nvidia-470-server-generic-hwe-22.04) nvidia-driver-525-server, (kernel modules provided by linux-modules-nvidia-525-server-generic-hwe-22.04) nvidia-driver-515-server, (kernel modules provided by linux-modules-nvidia-515-server-generic-hwe-22.04) nvidia-driver-510, (kernel modules provided by linux-modules-nvidia-510-generic-hwe-22.04) nvidia-driver-525, (kernel modules provided by linux-modules-nvidia-525-generic-hwe-22.04) nvidia-driver-470, (kernel modules provided by linux-modules-nvidia-470-generic-hwe-22.04)
Then use ubuntu-drivers install nvidia-driver-530
to install the driver, example:
$ ubuntu-drivers install nvidia-driver-530 All the available drivers are already installed.
It’s that simple
PS: Of course, you have to go to the NVIDIA official website to download it yourself. For details, please refer to the reference materials.
Python
To engage in AI, Python is a must, but we do not directly use the system’s Python environment, but use conda to manage it.
It is recommended to use miniconda3 which is lighter than anaconda.
After following the instructions on the official website and following miniconda3, you only need to use the following command to create a specified version of the python environment
conda create -n environment name python=3.10
If you encounter network environment problems, you can refer to my previous article to configure domestic mirroring: configure pip domestic mirroring to speed up the installation of python third-party libraries~
4ChatGLM-6B
Introduction
This is an open source LLM developed by Tsinghua University and Zhipu Company. As of the writing of this article, it is regarded as a ceiling in the domestic open source LLM~
ChatGLM-6B is an open-source, Chinese-English bilingual conversational language model based on the General Language Model (GLM) architecture with 6.2 billion parameters. Combined with model quantization technology, users can deploy locally on consumer-grade graphics cards (only 6GB of video memory is required at the INT4 quantization level). ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese Q&A and dialogue. After about 1T identifiers of Chinese and English bilingual training, supplemented by supervision and fine-tuning, feedback self-help, human feedback reinforcement learning and other technologies, ChatGLM-6B with 6.2 billion parameters has been able to generate answers that are quite in line with human preferences.
Hardware requirements
Quantization level | Minimum GPU memory (inference) | Minimum GPU memory (fine-tuning for efficient parameters) |
---|---|---|
FP16 (no quantization) | 13 GB | 14 GB |
INT8 | 8 GB | 9 GB |
INT4 | 6 GB | 7 GB |
Local deployment
Download project code
git clone https://github.com/THUDM/ChatGLM-6B.git
PS: You can also use the modified version of my fork, mainly made the following modifications:
-
Multi-card acceleration is enabled by default for deployment and model fine-tuning
-
Rewrite the API interface, more intuitive
Just replace it with the following command
git clone https://github.com/Deali-Axy/ChatGLM-6B.git
Create a virtual environment
It is recommended to use conda management
conda create -n chatglm python==3.8
Install dependencies
cd ChatGLM-6B conda activate chatglm pip install -r requirements.txt conda install cudatoolkit=11.7 -c nvidia
PS: If
cudatoolkit
is not installed,RuntimeError: Library cudart is not initialized
will be reportedHowever, some people in issues also said that it can be solved by using the CPU to output the quantized model and calling it directly, but they have not tried it yet.
Issues address: https://github.com/THUDM/ChatGLM-6B/issues/115
Download model and launch
There are two kinds of demos in the project code: command line and web interface. Choose any one to run, and the program will automatically download the pre-trained model from huggingface.
PS: In theory, the model of huggingface can be downloaded directly. If you encounter network problems, please use a proxy or download the model from an official cloud disk.
# command line demo python cli_demo.py # Simple web interface implemented using Gradio python web_demo.py
The default port of Gradio is 7860, you can customize the port by passing the server_port
parameter in the launch()
method.
Using quantitative models
If the video memory is not more than 13G, the FP16 precision model cannot be run, only the quantized model can be run, and the code needs to be modified.
Open the cli_demo.py
or web_demo.py
code above
Find the following code to load the model and modify the parameters
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
Modify the above code to the following to use the quantized model
# Modify as needed, currently only supports 4/8 bit quantization model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).quantize(4).half().cuda()
Operation effect
Multi-card acceleration
If there are many graphics cards, multiple cards can be used to speed up inference.
Still open the above cli_demo.py
or web_demo.py
code.
Find the following code to load the model
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
change into
from utils import load_model_on_gpus model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4)
The num_gpus
parameter is the number of graphics cards to be used
I read the code of the load_model_on_gpus
method. It divides the transformer into 30 layers through the auto_configure_device_map
method, and then distributes it to the specified number of graphics cards. It is not like The CUDA_VISIBLE_DEVICES
environment variable is also specified by the graphics card number and can only be assigned in order.
If you want to run other models on the machine at the same time, you can consider running this ChatGLM first, and then run the others, or rewrite the auto_configure_device_map
method so that it can flexibly specify the graphics card.
Authorization
The model cannot be used directly for commercial use. It is said that commercial use requires the purchase of a 180w license for one year.
5MOSS
Introduction
This is a large open source model of Fudan University. The biggest difference between using it and ChatGLM is that the reasoning speed is very slow.
MOSS is an open source dialogue language model that supports Chinese-English bilingual and various plug-ins. The
moss-moon
series models have 16 billion parameters, and can be used on a single A100/A800 or two 3090 graphics cards at FP16 precision. It can run on a single 3090 graphics card with INT4/8 precision. The MOSS pedestal language model is pre-trained on about 700 billion Chinese, English and code words. After fine-tuning of dialogue instructions, plug-in enhanced learning and human preference training, it has the ability to have multiple rounds of dialogue and the ability to use multiple plug-ins.
Hardware requirements
quantization level | loading model | completed a round of dialogue (estimated value) | maximum dialogue length 2048 reached |
---|---|---|---|
FP16 | 31GB | 42GB | 81GB |
Int8 | 16GB | 24GB | 46GB |
Int4 | 7.8GB | 12GB | 26GB |
Local deployment
Download Code
git clone https://github.com/OpenLMLab/MOSS.git
Create a virtual environment
It is recommended to use conda management
conda create -n moss python==3.8
Install dependencies
cd MOSS conda activate moss pip install -r requirements.txt conda install cudatoolkit=11.7 -c nvidia
Download model and launch
There are two kinds of demos in the project code: command line and web interface. Choose any one to run, and the program will automatically download the pre-trained model from huggingface.
# command line demo python moss_cli_demo.py # Simple web interface implemented using Gradio python moss_web_demo_gradio.py
Modify the default model and multi-card acceleration
Because MOSS has relatively high requirements for video memory, the 4-bit quantized model is used by default. Here I use a 4-block T4 server to deploy, so I directly use the FP16 model.
Modify moss_web_demo_gradio.py
, find the following code
parser.add_argument("--model_name", default="fnlp/moss-moon-003-sft-int4", ...)
Change the default
parameter to fnlp/moss-moon-003-sft
Then set the multi-card acceleration again, and set the GPU parameters to the numbers of the four graphics cards
parser.add_argument("--gpu", default="0,1,2,3", type=str)
Then start, you can see that the four graphics cards are full
The biggest feeling after using it is that it is slow, and it often takes a minute or two to generate an answer.
I looked at the GitHub issues and there are a lot of people asking the same question. Two A100s still need 10s to start, and the generation time is about 100s. It seems that there is no solution in a short time. We can only wait for the official optimization~
See:
-
https://github.com/OpenLMLab/MOSS/issues/87
Authorization
The model is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE
, which is free for commercial use.
6 References
-
https://wiki.mbalib.com/wiki/AIGC
-
https://en.wikipedia.org/wiki/Large_language_model
-
https://gitee.com/oschina/awesome-llm
-
https://github.com/Hannibal046/Awesome-LLM
-
Install NVIDIA Graphics Driver – https://www.zhihu.com/tardis/zm/art/59618999?source_id=1003
–
Technical group: Add Xiaobian WeChat and comment into the group
Editor WeChat: mm1552923
Public number: dotNet Programming Daquan