Llama2 quantifies Windows & amp;Linux local deployment through llama.cpp model
What is LLaMA 1 and 2
LLaMA, it is a set of basic language models with parameters ranging from 7B to 65B. models trained on trillions of tokens and show that state-of-the-art models can be trained exclusively using publicly available datasets without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) in most benchmarks, and LLaMA65B is competitive with the best models Chinchilla-70B and PaLM-540B.
Llama2, the sequel to Llama produced by Meta, a series of models (7b, 13b, 70b) are open source and available for commercial use. The accuracy of Llama2 exceeds that of Llama1 on all lists, and also exceeds all previous open source models.
However, for local deployment of large models, LLaMA requirements are still relatively high. Therefore, the open source solution llama.cpp was used for model quantification this time. The CPU quantified version was tested on the Windows platform and the GPU quantified version was tested on the Linux platform.
Note: All the following download steps require scientific Internet access, otherwise it will be very painful.
Experimental equipment details (for reference)
Windows platform
For notebook platforms, the Savior Y9000P
- CPU: 13th Intel i9-13900HX
×
\times
× 1
- GPU: NVIDIA GeForce RTX4060 (8GB)
×
\times
× 1
- Memory: 32GB
Operation status: The CPU runs smoothly the 8Bit quantized version of llama2-13B-chat, and the 16Bit quantized version runs with lag. The GPU version accelerates super fast, equivalent to the generation speed of Wen Xinyiyan or Chatgpt.
Operation status:
Linux platform
lab server
- COU: 9th Intel? Core? i9-9940X CPU @ 3.30GHz
×
\times
× 14
- GPU: NVIDIA GeForce RTX2080Ti (11GB)
×
\times
× 4
- Memory: 64GB
Operation status: Both 13B and 7B are running smoothly, but 70B suddenly cannot be downloaded for some reason and cannot be tested.
Model deployment detailed steps
Download and configure the llama library
-
downloadllama
git clone https://github.com/facebookresearch/llama.git
-
Configuration Environment
Create a virtual environment to prevent conflicts caused by packages previously installed in other environments
conda create -n llama python=3.10
Enter virtual environment
conda activate llama
Enter the project directory
cd llama
Install environment dependencies
pip install -e .
-
Apply for model download link
Enter this link: Mete website to apply for downloading the model, and fill in the content truthfully. In order to pass as quickly as possible, you can fill in American institutions and schools, which should be faster. At that time, I did not dare to try domestic ones for fear of being rejected (I was scared by OpenAI)
The following email will come later, copy the URL of the mosaic part:
-
Download model
-
Windows platform
sh download.sh
-
Linux platform
bash download.sh
Then follow the process and paste the previously copied link into it, and then select the model you need to download. You can Bing yourself for the difference between the models. The chat version is more recommended here. In terms of parameters, most devices with 7B can run it. I use The 13B version can also run normally, and you can choose it according to your personal needs.
-
Note: When downloading on Windows platform, you may face the error wget: command not found. Just follow the link below.
Solution to the error wget: command not found when running .sh file in Windows 10 environment
-
Download and configure the llama.cpp library
-
downloadllama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
-
Compile Build
-
Linux platform
Just enter the project directory make:
make
I have tested it on the autodl server and the lab server and there are no problems.
-
Windows platform
Windows platform needs to install cmake and gcc. I have installed them on my machine before. If they are not installed, please install them on Baidu yourself.
Compile:
mkdir build
cd build
cmake ..
cmake --build . --config Release
-
-
CUDA accelerated version compilation, just add some instructions
-
Linux platform
make LLAMA_CUBLAS=1
-
Windows platform
mkdir build cd build cmake .. -DLLAMA_CUBLAS=ON cmake --build . --config Release
-
Model quantification
-
Prepare data
Copy the downloaded data (llama-2-7B-chat) in llama to ./models in llama.cpp. At the same time, copy tokenizer_checklist.chk and tokenizer.model in the main directory of llama to ./models.
Refer to the following:
G:. │ .editorconfig │ ggml-vocab-aquila.gguf │ ggml-vocab-baichuan.gguf │ ggml-vocab-falcon.gguf │ ggml-vocab-gpt-neox.gguf │ ggml-vocab-llama.gguf │ ggml-vocab-mpt.gguf │ ggml-vocab-refact.gguf │ ggml-vocab-starcoder.gguf │ tokenizer.model │ tokenizer_checklist.chk │ └─13B checklist.chk consolidated.00.pth consolidated.01.pth params.json
-
To quantify
Enter the virtual environment and install dependencies
cd llama.cpp
conda activate llama
Install dependencies
pip install -r requirements.txt
Perform 16Bit conversion
python convert.py models/13B/
If you get an error in this step. Modify ./models/(model storage folder)/params.json
Just change the value in the last “vocab_size”: to 32000-
Linux 4 or 8 bit quantization
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
The path is adjusted according to your own path. If you perform 8-bit quantization, change q4_0 in the command to q8_0:
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q8_0.gguf q8_0
8bit is definitely better than 4bit, but it depends on the equipment situation.
-
Windows 4 or 8 bit quantization
.\build\bin\Release\quantize.exe .\models\13B\ggml-model-f16.gguf .\models\13B\7B\ggml-model-q4_0.gguf q4_0
To change the bit, please also refer to the above.
-
Load and start the model
CPU version
-
Windows platform
.\build\bin\Release\main.exe -m .\models\13B\ggml-model-q4_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\prompts\chat-with-bob.txt
-
Linux platform
./main -m ./models/13B/ggml-model-q8_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\ prompts\chat-with-bob.txt
GPU acceleration
Just add -ngl 1
to the command
The number can be modified, the maximum is 35, I measured 20 on the 4060 to achieve the best
-
Windows platform
.\build\bin\Release\main.exe -m .\models\13B\ggml-model-q4_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\prompts\chat-with-bob.txt -ngl 20
-
Linux platform
./main -m ./models/13B/ggml-model-q8_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f ./ prompts/chat-with-bob.txt -ngl 20
Enter your prompt after the prompt >
, cmd/ctrl + c
interrupts the output, and multi-line information ends with \
. To view help and parameter descriptions, please execute the ./main -h
command. Here are some commonly used parameters:
-c controls the length of the context. The larger the value, the longer the conversation history can be referenced (default: 512) -ins starts the instruction running mode of ChatGPT-like dialogue communication -f specifies prompt template, please load prompts/alpaca.txt for alpaca model -n controls the maximum length of reply generation (default: 128) -b controls batch size (default: 8), which can be increased appropriately -t controls the number of threads (default: 4), which can be increased appropriately --repeat_penalty controls the penalty for repeated text in generated replies --temp temperature coefficient, the lower the value, the smaller the randomness of the recovery, and vice versa --top_p, top_k control parameters related to decoding sampling
For specific information, refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main