LLM Exploration: Environment Construction and Model Local Deployment

1 Preface

Recently, I have been doing alchemy (AIGC), and suddenly found that the business code is boring…

Last time I published an article on AI drawing. Although ChatGPT cannot be deployed by itself, there are still many open source LLMs. As long as there is a similar graphics card, it is no problem to deploy LLM locally.

This article will introduce the local deployment of the following two domestic open source LLMs

ChatGLM-6B
MOSS

This article simply runs the model first, and then I will continue some of my recent exploration records in the direction of LLM~

2 concepts

Before we start, let’s look at some basic concepts.

AIGC

Quoting the content of mbalib below

AIGC (AI Generated Content) is artificial intelligence generated content, also known as “Generative AI” (Generative AI), which is considered to be the A new content creation method after Produced Content (PGC) and User Generated Content (UGC).

The Internet content production method has gone through the process of PGC-UGC-AIGC. PGC (Professional Generated Content) is professionally produced content, such as text and video produced by professionals in the Web1.0 and broadcasting industry, which is characterized by professionalism and guaranteed content quality. UGC (User Generated Content) is user-generated content, which is produced with the concept of Web2.0. It is characterized by users’ free upload of content and rich content. AIGC (AI Generated Content) is content generated by AI, which is characterized by automated production and high efficiency. With the maturity of natural language generation technology NLG and AI models, AIGC has gradually attracted everyone’s attention, and now it can automatically generate text, pictures, audio, video, and even 3D models and codes.

Recently, many ChatGPT and AI drawing belong to this field.

LLM

Quote from wikipedia below

A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabeled text using self-supervised learning or semi-supervised learning. LLMs emerged around 2018 and perform well at a wide variety of tasks. This has shifted the focus of natural language processing research away from the previous paradigm of training specialized supervised models for specific tasks.

Chinese is also called “Large Language Model”. Now the very popular ChatGPT is the representative of this LLM. The large model has a key attribute: the number of parameters. The size of the parameter determines the ability of the large model (it cannot be said to be absolute, but it must is positively correlated).

The following are the parameter quantities of common LLMs:

LLM name	Amount of parameters
ChatGPT 3.5	175B
ChatGLM	6B
MOSS	16B
LLaMA	7B/13B/33B/65B

The length of the relationship only lists these few, and more can be found in the reference materials at the end of the article.

3 Build environment

Hardware

First, there must be a Linux system server/computer equipped with an NVIDIA graphics card.

The video memory needs to reach 8G and above, otherwise it won’t work~

The system recommends using the latest Ubuntu (22.04) or its derivatives. The following are the two server configurations I used during the test.

server 1

CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
Memory: 64G
Graphics card: NVIDIA GeForce RTX 2080 Ti

server 2

CPU: Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz x2
Memory: 128G
Graphics card: Tesla T4 x4

Software

After talking about the hardware, let’s look at the software.

Driver

First of all, you need a graphics card driver. It is easier to install a graphics card driver in Ubuntu-based distributions than to drink water. This is why Ubuntu is recommended for alchemy.

PS: It can be completed with one click. You don’t need to go to the Internet to check what blogs you have copied hundreds of times, then download a bunch of things, compile and uninstall nouveau~

The Ubuntu desktop version can directly use the “Software Update” App to install the graphics card driver with one click.

Ubuntu server version, use the nvidia-detector command to detect the driver version that needs to be installed, for example:

$ nvidia-detector
nvidia-driver-530

Use ubuntu-drivers list to get the list of installable drivers, example:

$ ubuntu-drivers list
nvidia-driver-418-server, (kernel modules provided by nvidia-dkms-418-server)
nvidia-driver-530, (kernel modules provided by linux-modules-nvidia-530-generic-hwe-22.04)
nvidia-driver-450-server, (kernel modules provided by linux-modules-nvidia-450-server-generic-hwe-22.04)
nvidia-driver-515, (kernel modules provided by linux-modules-nvidia-515-generic-hwe-22.04)
nvidia-driver-470-server, (kernel modules provided by linux-modules-nvidia-470-server-generic-hwe-22.04)
nvidia-driver-525-server, (kernel modules provided by linux-modules-nvidia-525-server-generic-hwe-22.04)
nvidia-driver-515-server, (kernel modules provided by linux-modules-nvidia-515-server-generic-hwe-22.04)
nvidia-driver-510, (kernel modules provided by linux-modules-nvidia-510-generic-hwe-22.04)
nvidia-driver-525, (kernel modules provided by linux-modules-nvidia-525-generic-hwe-22.04)
nvidia-driver-470, (kernel modules provided by linux-modules-nvidia-470-generic-hwe-22.04)

Then use ubuntu-drivers install nvidia-driver-530 to install the driver, example:

$ ubuntu-drivers install nvidia-driver-530

All the available drivers are already installed.

It’s that simple

PS: Of course, you have to go to the NVIDIA official website to download it yourself. For details, please refer to the reference materials.

Python

To engage in AI, Python is a must, but we do not directly use the system’s Python environment, but use conda to manage it.

It is recommended to use miniconda3 which is lighter than anaconda.

After following the instructions on the official website and following miniconda3, you only need to use the following command to create a specified version of the python environment

conda create -n environment name python=3.10

If you encounter network environment problems, you can refer to my previous article to configure domestic mirroring: configure pip domestic mirroring to speed up the installation of python third-party libraries~

4ChatGLM-6B

Introduction

This is an open source LLM developed by Tsinghua University and Zhipu Company. As of the writing of this article, it is regarded as a ceiling in the domestic open source LLM~

ChatGLM-6B is an open-source, Chinese-English bilingual conversational language model based on the General Language Model (GLM) architecture with 6.2 billion parameters. Combined with model quantization technology, users can deploy locally on consumer-grade graphics cards (only 6GB of video memory is required at the INT4 quantization level). ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese Q&A and dialogue. After about 1T identifiers of Chinese and English bilingual training, supplemented by supervision and fine-tuning, feedback self-help, human feedback reinforcement learning and other technologies, ChatGLM-6B with 6.2 billion parameters has been able to generate answers that are quite in line with human preferences.

Hardware requirements

Quantization level	Minimum GPU memory (inference)	Minimum GPU memory (fine-tuning for efficient parameters)
FP16 (no quantization)	13 GB	14 GB
INT8	8 GB	9 GB
INT4	6 GB	7 GB

Local deployment

Download project code

git clone https://github.com/THUDM/ChatGLM-6B.git

PS: You can also use the modified version of my fork, mainly made the following modifications:

Multi-card acceleration is enabled by default for deployment and model fine-tuning
Rewrite the API interface, more intuitive

Just replace it with the following command

git clone https://github.com/Deali-Axy/ChatGLM-6B.git

Create a virtual environment

It is recommended to use conda management

conda create -n chatglm python==3.8

Install dependencies

cd ChatGLM-6B
conda activate chatglm
pip install -r requirements.txt
conda install cudatoolkit=11.7 -c nvidia

PS: If cudatoolkit is not installed, RuntimeError: Library cudart is not initialized will be reported

However, some people in issues also said that it can be solved by using the CPU to output the quantized model and calling it directly, but they have not tried it yet.

Issues address: https://github.com/THUDM/ChatGLM-6B/issues/115

Download model and launch

There are two kinds of demos in the project code: command line and web interface. Choose any one to run, and the program will automatically download the pre-trained model from huggingface.

PS: In theory, the model of huggingface can be downloaded directly. If you encounter network problems, please use a proxy or download the model from an official cloud disk.

# command line demo
python cli_demo.py
# Simple web interface implemented using Gradio
python web_demo.py

The default port of Gradio is 7860, you can customize the port by passing the server_port parameter in the launch() method.

Using quantitative models

If the video memory is not more than 13G, the FP16 precision model cannot be run, only the quantized model can be run, and the code needs to be modified.

Open the cli_demo.py or web_demo.py code above

Find the following code to load the model and modify the parameters

model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()

Modify the above code to the following to use the quantized model

# Modify as needed, currently only supports 4/8 bit quantization
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).quantize(4).half().cuda()

Operation effect

image

Multi-card acceleration

If there are many graphics cards, multiple cards can be used to speed up inference.

Still open the above cli_demo.py or web_demo.py code.

Find the following code to load the model

model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()

change into

from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4)

The num_gpus parameter is the number of graphics cards to be used

I read the code of the load_model_on_gpus method. It divides the transformer into 30 layers through the auto_configure_device_map method, and then distributes it to the specified number of graphics cards. It is not like The CUDA_VISIBLE_DEVICES environment variable is also specified by the graphics card number and can only be assigned in order.

If you want to run other models on the machine at the same time, you can consider running this ChatGLM first, and then run the others, or rewrite the auto_configure_device_map method so that it can flexibly specify the graphics card.

Authorization

The model cannot be used directly for commercial use. It is said that commercial use requires the purchase of a 180w license for one year.

5MOSS

Introduction

This is a large open source model of Fudan University. The biggest difference between using it and ChatGLM is that the reasoning speed is very slow.

MOSS is an open source dialogue language model that supports Chinese-English bilingual and various plug-ins. The moss-moon series models have 16 billion parameters, and can be used on a single A100/A800 or two 3090 graphics cards at FP16 precision. It can run on a single 3090 graphics card with INT4/8 precision. The MOSS pedestal language model is pre-trained on about 700 billion Chinese, English and code words. After fine-tuning of dialogue instructions, plug-in enhanced learning and human preference training, it has the ability to have multiple rounds of dialogue and the ability to use multiple plug-ins.

Hardware requirements

quantization level	loading model	completed a round of dialogue (estimated value)	maximum dialogue length 2048 reached
FP16	31GB	42GB	81GB
Int8	16GB	24GB	46GB
Int4	7.8GB	12GB	26GB

Local deployment

Download Code

git clone https://github.com/OpenLMLab/MOSS.git

Create a virtual environment

It is recommended to use conda management

conda create -n moss python==3.8

Install dependencies

cd MOSS
conda activate moss
pip install -r requirements.txt
conda install cudatoolkit=11.7 -c nvidia

Download model and launch

There are two kinds of demos in the project code: command line and web interface. Choose any one to run, and the program will automatically download the pre-trained model from huggingface.

# command line demo
python moss_cli_demo.py
# Simple web interface implemented using Gradio
python moss_web_demo_gradio.py

Modify the default model and multi-card acceleration

Because MOSS has relatively high requirements for video memory, the 4-bit quantized model is used by default. Here I use a 4-block T4 server to deploy, so I directly use the FP16 model.

Modify moss_web_demo_gradio.py, find the following code

parser.add_argument("--model_name", default="fnlp/moss-moon-003-sft-int4",
                    ...)

Change the default parameter to fnlp/moss-moon-003-sft

Then set the multi-card acceleration again, and set the GPU parameters to the numbers of the four graphics cards

parser.add_argument("--gpu", default="0,1,2,3", type=str)

Then start, you can see that the four graphics cards are full

image

The biggest feeling after using it is that it is slow, and it often takes a minute or two to generate an answer.

I looked at the GitHub issues and there are a lot of people asking the same question. Two A100s still need 10s to start, and the generation time is about 100s. It seems that there is no solution in a short time. We can only wait for the official optimization~

See:

https://github.com/OpenLMLab/MOSS/issues/87

Authorization

The model is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE, which is free for commercial use.

6 References

https://wiki.mbalib.com/wiki/AIGC
https://en.wikipedia.org/wiki/Large_language_model
https://gitee.com/oschina/awesome-llm
https://github.com/Hannibal046/Awesome-LLM
Install NVIDIA Graphics Driver – https://www.zhihu.com/tardis/zm/art/59618999?source_id=1003

–

Technical group: Add Xiaobian WeChat and comment into the group

Editor WeChat: mm1552923

Public number: dotNet Programming Daquan