Llama2 quantifies Windows & Linux local deployment through llama.cpp model

Llama2 quantifies Windows & amp;Linux local deployment through llama.cpp model

What is LLaMA 1 and 2

LLaMA, it is a set of basic language models with parameters ranging from 7B to 65B. models trained on trillions of tokens and show that state-of-the-art models can be trained exclusively using publicly available datasets without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) in most benchmarks, and LLaMA65B is competitive with the best models Chinchilla-70B and PaLM-540B.

Llama2, the sequel to Llama produced by Meta, a series of models (7b, 13b, 70b) are open source and available for commercial use. The accuracy of Llama2 exceeds that of Llama1 on all lists, and also exceeds all previous open source models.

However, for local deployment of large models, LLaMA requirements are still relatively high. Therefore, the open source solution llama.cpp was used for model quantification this time. The CPU quantified version was tested on the Windows platform and the GPU quantified version was tested on the Linux platform.

Note: All the following download steps require scientific Internet access, otherwise it will be very painful.

Experimental equipment details (for reference)

Windows platform

For notebook platforms, the Savior Y9000P

CPU: 13th Intel i9-13900HX
×

\times

× 1
GPU: NVIDIA GeForce RTX4060 (8GB)
×

\times

× 1
Memory: 32GB

Operation status: The CPU runs smoothly the 8Bit quantized version of llama2-13B-chat, and the 16Bit quantized version runs with lag. The GPU version accelerates super fast, equivalent to the generation speed of Wen Xinyiyan or Chatgpt.

Operation status:

Linux platform

lab server

COU: 9th Intel? Core? i9-9940X CPU @ 3.30GHz
×

\times

× 14
GPU: NVIDIA GeForce RTX2080Ti (11GB)
×

\times

× 4
Memory: 64GB

Operation status: Both 13B and 7B are running smoothly, but 70B suddenly cannot be downloaded for some reason and cannot be tested.

Model deployment detailed steps

Download and configure the llama library

downloadllama

git clone https://github.com/facebookresearch/llama.git

Configuration Environment

Create a virtual environment to prevent conflicts caused by packages previously installed in other environments
```
conda create -n llama python=3.10
```
Enter virtual environment
```
conda activate llama
```
Enter the project directory
```
cd llama
```
Install environment dependencies
```
pip install -e .
```
Apply for model download link

Enter this link: Mete website to apply for downloading the model, and fill in the content truthfully. In order to pass as quickly as possible, you can fill in American institutions and schools, which should be faster. At that time, I did not dare to try domestic ones for fear of being rejected (I was scared by OpenAI)

The following email will come later, copy the URL of the mosaic part:
Download model
- Windows platform
```
sh download.sh
```
- Linux platform
```
bash download.sh
```
Then follow the process and paste the previously copied link into it, and then select the model you need to download. You can Bing yourself for the difference between the models. The chat version is more recommended here. In terms of parameters, most devices with 7B can run it. I use The 13B version can also run normally, and you can choose it according to your personal needs.
- Note: When downloading on Windows platform, you may face the error wget: command not found. Just follow the link below.
  
  Solution to the error wget: command not found when running .sh file in Windows 10 environment

Download and configure the llama.cpp library

downloadllama.cpp

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

Compile Build
- Linux platform
  
  Just enter the project directory make:
```
make
```
  I have tested it on the autodl server and the lab server and there are no problems.
- Windows platform
  
  Windows platform needs to install cmake and gcc. I have installed them on my machine before. If they are not installed, please install them on Baidu yourself.
  
  Compile:
```
mkdir build
```
```
cd build
```
```
cmake ..
```
```
cmake --build . --config Release
```

CUDA accelerated version compilation, just add some instructions

Linux platform
```
make LLAMA_CUBLAS=1
```

Windows platform

mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

Model quantification

Prepare data

Copy the downloaded data (llama-2-7B-chat) in llama to ./models in llama.cpp. At the same time, copy tokenizer_checklist.chk and tokenizer.model in the main directory of llama to ./models.

Refer to the following:

G:.
│ .editorconfig
│ ggml-vocab-aquila.gguf
│ ggml-vocab-baichuan.gguf
│ ggml-vocab-falcon.gguf
│ ggml-vocab-gpt-neox.gguf
│ ggml-vocab-llama.gguf
│ ggml-vocab-mpt.gguf
│ ggml-vocab-refact.gguf
│ ggml-vocab-starcoder.gguf
│ tokenizer.model
│ tokenizer_checklist.chk
│
└─13B
        checklist.chk
        consolidated.00.pth
        consolidated.01.pth
        params.json

To quantify

Enter the virtual environment and install dependencies
```
cd llama.cpp
```
```
conda activate llama
```
Install dependencies
```
pip install -r requirements.txt
```
Perform 16Bit conversion
```
python convert.py models/13B/
```
If you get an error in this step. Modify ./models/(model storage folder)/params.json
Just change the value in the last “vocab_size”: to 32000
- Linux 4 or 8 bit quantization
```
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
```
  The path is adjusted according to your own path. If you perform 8-bit quantization, change q4_0 in the command to q8_0:
```
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q8_0.gguf q8_0
```
  8bit is definitely better than 4bit, but it depends on the equipment situation.
- Windows 4 or 8 bit quantization
```
.\build\bin\Release\quantize.exe .\models\13B\ggml-model-f16.gguf .\models\13B\7B\ggml-model-q4_0.gguf q4_0
```
  To change the bit, please also refer to the above.

Load and start the model

CPU version

Windows platform

.\build\bin\Release\main.exe -m .\models\13B\ggml-model-q4_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\prompts\chat-with-bob.txt

Linux platform

./main -m ./models/13B/ggml-model-q8_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\ prompts\chat-with-bob.txt

GPU acceleration

Just add -ngl 1 to the command

The number can be modified, the maximum is 35, I measured 20 on the 4060 to achieve the best

Windows platform

.\build\bin\Release\main.exe -m .\models\13B\ggml-model-q4_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\prompts\chat-with-bob.txt -ngl 20

Linux platform

./main -m ./models/13B/ggml-model-q8_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f ./ prompts/chat-with-bob.txt -ngl 20

Enter your prompt after the prompt >, cmd/ctrl + c interrupts the output, and multi-line information ends with \. To view help and parameter descriptions, please execute the ./main -h command. Here are some commonly used parameters:

-c controls the length of the context. The larger the value, the longer the conversation history can be referenced (default: 512)
-ins starts the instruction running mode of ChatGPT-like dialogue communication
-f specifies prompt template, please load prompts/alpaca.txt for alpaca model
-n controls the maximum length of reply generation (default: 128)
-b controls batch size (default: 8), which can be increased appropriately
-t controls the number of threads (default: 4), which can be increased appropriately
--repeat_penalty controls the penalty for repeated text in generated replies
--temp temperature coefficient, the lower the value, the smaller the randomness of the recovery, and vice versa
--top_p, top_k control parameters related to decoding sampling

For specific information, refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main