Llama2 quantifies Windows & Linux local deployment through llama.cpp model

Llama2 quantifies Windows & amp;Linux local deployment through llama.cpp model

What is LLaMA 1 and 2

LLaMA, it is a set of basic language models with parameters ranging from 7B to 65B. models trained on trillions of tokens and show that state-of-the-art models can be trained exclusively using publicly available datasets without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) in most benchmarks, and LLaMA65B is competitive with the best models Chinchilla-70B and PaLM-540B.

Llama2, the sequel to Llama produced by Meta, a series of models (7b, 13b, 70b) are open source and available for commercial use. The accuracy of Llama2 exceeds that of Llama1 on all lists, and also exceeds all previous open source models.

However, for local deployment of large models, LLaMA requirements are still relatively high. Therefore, the open source solution llama.cpp was used for model quantification this time. The CPU quantified version was tested on the Windows platform and the GPU quantified version was tested on the Linux platform.

Note: All the following download steps require scientific Internet access, otherwise it will be very painful.

Experimental equipment details (for reference)

Windows platform

For notebook platforms, the Savior Y9000P

  • CPU: 13th Intel i9-13900HX

    ×

    \times

    × 1

  • GPU: NVIDIA GeForce RTX4060 (8GB)

    ×

    \times

    × 1

  • Memory: 32GB

Operation status: The CPU runs smoothly the 8Bit quantized version of llama2-13B-chat, and the 16Bit quantized version runs with lag. The GPU version accelerates super fast, equivalent to the generation speed of Wen Xinyiyan or Chatgpt.

Operation status:

Linux platform

lab server

  • COU: 9th Intel? Core? i9-9940X CPU @ 3.30GHz

    ×

    \times

    × 14

  • GPU: NVIDIA GeForce RTX2080Ti (11GB)

    ×

    \times

    × 4

  • Memory: 64GB

Operation status: Both 13B and 7B are running smoothly, but 70B suddenly cannot be downloaded for some reason and cannot be tested.

Model deployment detailed steps

Download and configure the llama library

  • downloadllama

    git clone https://github.com/facebookresearch/llama.git
    
  • Configuration Environment

    Create a virtual environment to prevent conflicts caused by packages previously installed in other environments

    conda create -n llama python=3.10
    

    Enter virtual environment

    conda activate llama
    

    Enter the project directory

    cd llama
    

    Install environment dependencies

    pip install -e .
    
  • Apply for model download link

    Enter this link: Mete website to apply for downloading the model, and fill in the content truthfully. In order to pass as quickly as possible, you can fill in American institutions and schools, which should be faster. At that time, I did not dare to try domestic ones for fear of being rejected (I was scared by OpenAI)

    The following email will come later, copy the URL of the mosaic part:

  • Download model

    • Windows platform

      sh download.sh
      
    • Linux platform

      bash download.sh
      

    Then follow the process and paste the previously copied link into it, and then select the model you need to download. You can Bing yourself for the difference between the models. The chat version is more recommended here. In terms of parameters, most devices with 7B can run it. I use The 13B version can also run normally, and you can choose it according to your personal needs.

    • Note: When downloading on Windows platform, you may face the error wget: command not found. Just follow the link below.

      Solution to the error wget: command not found when running .sh file in Windows 10 environment

Download and configure the llama.cpp library

  • downloadllama.cpp

    git clone https://github.com/ggerganov/llama.cpp.git
    
    cd llama.cpp
    
  • Compile Build

    • Linux platform

      Just enter the project directory make:

      make
      

      I have tested it on the autodl server and the lab server and there are no problems.

    • Windows platform

      Windows platform needs to install cmake and gcc. I have installed them on my machine before. If they are not installed, please install them on Baidu yourself.

      Compile:

      mkdir build
      
      cd build
      
      cmake ..
      
      cmake --build . --config Release
      
  • CUDA accelerated version compilation, just add some instructions

    • Linux platform

      make LLAMA_CUBLAS=1
      
    • Windows platform

      mkdir build
      cd build
      cmake .. -DLLAMA_CUBLAS=ON
      cmake --build . --config Release
      

Model quantification

  • Prepare data

    Copy the downloaded data (llama-2-7B-chat) in llama to ./models in llama.cpp. At the same time, copy tokenizer_checklist.chk and tokenizer.model in the main directory of llama to ./models.

    Refer to the following:

    G:.
    │ .editorconfig
    │ ggml-vocab-aquila.gguf
    │ ggml-vocab-baichuan.gguf
    │ ggml-vocab-falcon.gguf
    │ ggml-vocab-gpt-neox.gguf
    │ ggml-vocab-llama.gguf
    │ ggml-vocab-mpt.gguf
    │ ggml-vocab-refact.gguf
    │ ggml-vocab-starcoder.gguf
    │ tokenizer.model
    │ tokenizer_checklist.chk
    │
    └─13B
            checklist.chk
            consolidated.00.pth
            consolidated.01.pth
            params.json
    
  • To quantify

    Enter the virtual environment and install dependencies

    cd llama.cpp
    
    conda activate llama
    

    Install dependencies

    pip install -r requirements.txt
    

    Perform 16Bit conversion

    python convert.py models/13B/
    

    If you get an error in this step. Modify ./models/(model storage folder)/params.json
    Just change the value in the last “vocab_size”: to 32000

    • Linux 4 or 8 bit quantization

      ./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
      

      The path is adjusted according to your own path. If you perform 8-bit quantization, change q4_0 in the command to q8_0:

      ./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q8_0.gguf q8_0
      

      8bit is definitely better than 4bit, but it depends on the equipment situation.

    • Windows 4 or 8 bit quantization

      .\build\bin\Release\quantize.exe .\models\13B\ggml-model-f16.gguf .\models\13B\7B\ggml-model-q4_0.gguf q4_0
      

      To change the bit, please also refer to the above.

Load and start the model

CPU version

  • Windows platform

    .\build\bin\Release\main.exe -m .\models\13B\ggml-model-q4_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\prompts\chat-with-bob.txt
    
  • Linux platform

    ./main -m ./models/13B/ggml-model-q8_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\ prompts\chat-with-bob.txt
    

GPU acceleration

Just add -ngl 1 to the command

The number can be modified, the maximum is 35, I measured 20 on the 4060 to achieve the best

  • Windows platform

    .\build\bin\Release\main.exe -m .\models\13B\ggml-model-q4_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f .\prompts\chat-with-bob.txt -ngl 20
    
  • Linux platform

    ./main -m ./models/13B/ggml-model-q8_0.gguf -n 256 -t 18 --repeat_penalty 1.0 --color -i -r "User:" -f ./ prompts/chat-with-bob.txt -ngl 20
    

Enter your prompt after the prompt >, cmd/ctrl + c interrupts the output, and multi-line information ends with \. To view help and parameter descriptions, please execute the ./main -h command. Here are some commonly used parameters:

-c controls the length of the context. The larger the value, the longer the conversation history can be referenced (default: 512)
-ins starts the instruction running mode of ChatGPT-like dialogue communication
-f specifies prompt template, please load prompts/alpaca.txt for alpaca model
-n controls the maximum length of reply generation (default: 128)
-b controls batch size (default: 8), which can be increased appropriately
-t controls the number of threads (default: 4), which can be increased appropriately
--repeat_penalty controls the penalty for repeated text in generated replies
--temp temperature coefficient, the lower the value, the smaller the randomness of the recovery, and vice versa
--top_p, top_k control parameters related to decoding sampling

For specific information, refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main