Fine-Tune LLaMA 65B Large Models Using Docker and Alpaca LoRA

In this article, let’s talk about how to use two graphics cards to fine-tune the LLaMA 65B large model, and how to fine-tune the 7B model in just a few hours on an ordinary 4090 home graphics card.

Write in front

In the previous articles, we have introduced three ways to run the 7B and 13B versions of the Meta open source model LLaMA:

“Model Talk: Using IN8 Quantitative Reasoning to Run Meta “Open Source Leaked” Large Model (LLaMA)”
“Model Miscellaneous Talk: Quickly Get Started with the Metaverse Big Factory Meta “Open Source Leakage” Large Model (LLaMA)”

However, in previous attempts, it is not difficult to find that without our “limited data”, the effect of the model is not particularly good, especially the 7B model with a relatively small amount of parameters. At the same time, it also made us more interested in the 65B model.

Of course, it was very difficult to complete the “improvement” (training, fine-tuning) of model capabilities on a graphics card with very few resources. However, with the birth of several projects, this matter has become much easier:

First of all, two weeks ago, some smart classmates from Stanford brought their “Stanford Alpaca” project: tatsu-lab/stanford_alpaca, obtained 52,000 entries from ChatGPT by using OpenAI’s API data, and then with the help of an A100 GPU server equipped with 4 80G memory, the fine-tuning of the 7B LLaMA model was completed, which brought amazing results and achieved similar The evaluation results of text-davinci-003 verified: In the case of small samples, a stand-alone server can complete the fine-tuning of the large language model, and can achieve good results, which greatly inspired the community< /strong>.

Then, another classmate tloen stood up in the community and used the LoRA (Low-Rank) method to complete a more exciting thing: the computing power used by Stanford students to fine-tune the model was changed from four 80G video memory The A100 was down to a 4090 graphics card and was able to fine-tune the job in 5 hours. It’s even possible to run larger models on a Raspberry Pi!

Of course, this matter can be established, in addition to tloen who is brave enough to try, the open-source PEFT project of the Hugging Face community, and the TimDettmers/bitsandbytes CUDA 8-bit model quantization project also contributed. Of course, there are still some problems in community projects, such as not supporting multi-card operation, not supporting the operation of the newer CUDA environment, and so on.

Well, so far, you have learned which “heroes of the open source community” have brought us benefits.

Now, let’s start with the fine-tuning of the 7B model. After mastering the 7B, we can easily toss the largest 65B model.

In order to facilitate the use and verify the effect, the scheme I used in this article has also been updated to the “LLaMA Playground” open source project mentioned earlier. Project address: soulteary/llama-docker-playground

Regarding the downloading and integrity verification of model files, etc., I mentioned it in the first article and will not repeat it. In addition, the use of the official reasoning scheme mentioned above and the Pyllama reasoning scheme provided by the community will no longer be expanded. If you are interested, you can read another previous article by yourself.

Using the LLaMA Docker Playground project

Still just find a suitable directory, use git clone or download the Zip archive, first download the code of the “LLaMA Playground” project to the local.

git clone https://github.com/soulteary/llama-docker-playground.git # or curl -sL -o llama.zip https://github.com/soulteary/llama-docker-playground/archive/refs/heads/main.zip

Then, enter the project directory and use the original PyTorch Docker base image from Nvidia to complete the construction of the basic environment. Compared with pulling the prepared image directly from DockerHub, building it yourself will save a lot of time.

We can build a Docker environment that can be used for large model fine-tune by executing the following command in the project directory:

docker build -t soulteary/llama:alpaca-lora-finetune . -f docker/Dockerfile.lora-finetune

Wait for a while, after the image is built, you can start playing.

Fine-tune the LLaMA 7B large model

If you want to fine-tune the model of a single card for LLaMA, it is divided into four steps.

Prepare model files

For the convenience of fine-tune, make sure your model directory is consistent with the following:

├── 7B │ ├── checklist.chk │ ├── consolidated.00.pth │ └── params.json ├── tokenizer.model └── tokenizer_checklist.chk

Prepare the container environment

In the previous article “Docker-based Deep Learning Environment: Getting Started”, we mentioned how to configure Docker to interact with the graphics card, so I won’t go into details here. You can execute a simple command to create a “clean and hygienic” container environment for large model fine-tuning:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ --rm -it \ -v /home/soulteary/project/llama-docker-playground/models:/app/alpaca-lora/original-weights \ -v `pwd`/weights:/app/alpaca-lora/weights \ soulteary/llama:alpaca-lora-finetune bash

In the above command, we mounted the original model file to the /app/alpaca-lora/original-weights directory of the container, which will be used later. And mount the weights folder of the current directory of the project to /app/alpaca-lora/weights in the container to save the HF model format to be used later.

Convert model format

Then, execute the following command in the container to convert the LLaMA model of Meta 7B into the format we need:

python -m transformers.models.llama.convert_llama_weights_to_hf \ --input_dir original-weights\ --model_size 7B \ --output_dir weights

The conversion time will not be very long (6 seconds here), just wait a moment:

# python -m transformers.models.llama.convert_llama_weights_to_hf \ # > --input_dir original-weights \ # > --model_size 7B \ # > --output_dir weights Fetching all parameters from the checkpoint at original-weights/7B. Loading the checkpoint in a Llama model. Loading checkpoint shards: 100%|████████████████████████████████████████████ ████████████████████████████████████████████████████ █████████████████████████████████████████| 33/33 [00:06<00 :00, 5.40it/s] Saving in the Transformers format. Fetching the tokenizer from original-weights/tokenizer.model.

Then check the weights directory, and you can see that the new model file is ready:

# du -hs weights/* 4.0K weights/config.json 4.0K weights/generation_config.json 9.3G weights/pytorch_model-00001-of-00002.bin 3.3G weights/pytorch_model-00002-of-00002.bin 28K weights/pytorch_model.bin.index.json 4.0K weights/special_tokens_map.json 492K weights/tokenizer.model 4.0K weights/tokenizer_config.json

Run the model fine-tuning program

Then, execute the finetune.py program for model fine-tuning:

python finetune.py

After the command is executed successfully, you will see log output similar to the following:

# python finetune.py ======================================= BUG REPORT============== ======================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ==================================================== ================================ CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so... Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning. Loading checkpoint shards: 100%|████████████████████████████████████████████ ████████████████████████████████████████████████████ ███████████████████████████████████████████| 2/2 [00:06 <00:00, 3.03s/it] Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-8d30498d25a7aa2b/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f... Downloading data files: 100%|██████████████████████████████████████████████ ████████████████████████████████████████████████████ ███████████████████████████████████████████| 1/1 [00:00 <00:00, 15709.00it/s] Extracting data files: 100%|██████████████████████████████████████████████ ████████████████████████████████████████████████████ ████████████████████████████████████████████| 1/1 [00 :00<00:00, 2291.97it/s] Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-8d30498d25a7aa2b/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subse wills data 100%|████████████████████████████████████████████████ ████████████████████████████████████████████████████ ████████████████████████████████████████████████████ ██████████████| 1/1 [00:00<00:00, 101.94it/s]

This will be a long process, about the length of three “Conan” films, just wait patiently~

During the model fine-tuning process, we use nvidia-smi to check the status of the graphics card, and we can see that the video memory is actually only used in the early 8G.

+ ---------------------------------------------- -------------------------------+ | NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 | |------------------------------- + ----------------- ----- + ---------------------- + | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=================================+ ================== ===== + =======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | Off | | 31% 53C P2 336W / 450W | 8563MiB / 24564MiB | 90% Default | | | | N/A | + ------------------------------- + ----------------- ----- + ---------------------- + + ------------------------------------------------- ---------------------------- + | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |==================================================== ===============================| | 0 N/A N/A 1290 G /usr/lib/xorg/Xorg 9MiB | | 0 N/A N/A 1510 G /usr/bin/gnome-shell 10MiB | | 0 N/A N/A 24135 C python 8538MiB | + ------------------------------------------------- ---------------------------- +

Well, we have mastered the most basic fine-tune, let’s take a look at how to use multiple graphics cards to fine-tune a large model, and fine-tune the 65B LLaMA large model.

Fine-tune the LLaMA 65B large model

If you want a fine-tune 65B model, you also need four steps.

Prepare model files

If you still want to train a 7B model, but want to improve efficiency through multiple cards, just use the model directory in the previous article. If you want to train the largest 65B model, because the model is much larger than the 7B version, we have two options:

Download and convert the 65B model file format yourself.

Directly download the converted model in the community, such as: decapoda-research/llama-65b-hf.

If you choose the same as above, you can place the downloaded original model file in a suitable location.

# ls llama/* llama/download.sh llama/tokenizer.model llama/tokenizer_checklist.chk llama/65B: checklist.chk consolidated.00.pth consolidated.01.pth consolidated.02.pth consolidated.03.pth consolidated.04.pth consolidated.05.pth consolidated.06.pth consolidated.07.pth params.json

If you choose to download the converted model directly from the HF community, you can use the following method: first visit the git-lfs project, and complete the tool installation according to your operating system; then use the git command Complete the download of the model file:

git clone https://huggingface.co/decapoda-research/llama-65b-hf

Prepare the container environment

If we want to use multiple graphics cards, we need to execute the following command to build a new container environment:

docker build -t soulteary/llama:alpaca-lora-65b-finetune . -f docker/Dockerfile.lora-65b-finetune

Then, we execute the following command to enter the environment container that enables multi-card fine-tuning training:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ --rm -it \ -v /home/soulteary/project/llama-docker-playground/models:/app/alpaca-lora/original-weights \ -v `pwd`/weights:/app/alpaca-lora/weights \ soulteary/llama:alpaca-lora-65b-finetune bash

Convert model format

If you choose to download the model from the HF community, you can skip reading this section.

If you choose to convert the model yourself, after entering the container, we can execute the following command to convert the model:

python -m transformers.models.llama.convert_llama_weights_to_hf \ --input_dir original-weights\ --model_size 65B \ --output_dir weights

Because the size of the model is huge, it takes a very long time to convert the format. It took me nearly an hour to complete the conversion:

# python -m transformers.models.llama.convert_llama_weights_to_hf \ #> --input_dir original-weights \ #> --model_size 65B \ #> --output_dir weights Fetching all parameters from the checkpoint at original-weights/65B. Loading the checkpoint in a Llama model. Loading checkpoint shards: 100%|████████████████████████████████████████████ ████████████████████████████████████████████████████ ████████████████████████████████████████████████████ █████| 81/81 [00:57<00:00, 1.40it/s] Saving in the Transformers format. Fetching the tokenizer from original-weights/tokenizer.model.

Checking the directory, we can see that we got 14 model files of almost the same size:

du -hs weights/ 123G weights/ du -hs weights/* 48K weights/config.json 48K weights/generation_config.json 11G weights/pytorch_model-00001-of-00014.bin 11G weights/pytorch_model-00002-of-00014.bin 11G weights/pytorch_model-00003-of-00014.bin 11G weights/pytorch_model-00004-of-00014.bin 11G weights/pytorch_model-00005-of-00014.bin 11G weights/pytorch_model-00006-of-00014.bin 11G weights/pytorch_model-00007-of-00014.bin 11G weights/pytorch_model-00008-of-00014.bin 11G weights/pytorch_model-00009-of-00014.bin 11G weights/pytorch_model-00010-of-00014.bin 11G weights/pytorch_model-00011-of-00014.bin 11G weights/pytorch_model-00012-of-00014.bin

Run the model fine-tuning program

As above, if you are an A100 user and have at least two cards, you can directly run the following program to start your 65B model fine-tuning journey:

python finetune.py

The execution time of the default parameters takes 44 hours, but if we double the MICRO_BATCH_SIZE and change it to 8, then the required time can be reduced to 33 hours. But the parameter thing needs actual test verification, according to your own situation.

# Adjust parameters [01:42<33:11:29, 102.22s/it] # Default parameters [04:33<44:16:54, 136.49s/it]

Well, so far, we have talked about how to use the polar cost fine-tune 7B and 65B large models easily and happily, and selectively use a single graphics card or multiple graphics cards.

Other

Next, I will talk about some details of this journey.

Nvidia base image selection

In this article, we did not choose to use the latest CUDA & amp; PyTorch image as in the previous article “Docker-based Deep Learning Environment: Getting Started”, but chose to use FROM nvcr.io/nvidia/ pytorch:22.12-py3, the complete Dockerfile is as follows:

FROM nvcr.io/nvidia/pytorch:22.12-py3 RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple WORKDIR /app RUN git clone https://github.com/tloen/alpaca-lora.git WORKDIR /app/alpaca-lora RUN pip install datasets loralib sentencepiece git + https://github.com/huggingface/transformers.git bitsandbytes git + https://github.com/huggingface/peft.git

The reason for this choice is that the core of the project depends on bitsandbytes. Currently, CUDA 12 cannot be recognized correctly, which will cause the program to fail to start. The temporary patch will run the program, but I see that the project has been updated recently, and there are actually a lot of related feedback in the community. Perhaps it is more reliable to use an environment that has been verified by long-term use.

In the PyTorch image release records of the Nvidia community, we can find the latest image version that satisfies the normal operation of the software, including CUDA (11.8.0) and PyTorch (1.14.0a0 + 410ce96) and TensorRT (8.5.1 ).

Why the original Alpaca LoRA can’t run with multiple cards

In fact, this should be regarded as a problem of the transformers component. The earliest discoverer of this problem is a classmate kooshi in the community. In addition to enabling parallelization by activating the model.parallel parameter In addition, he also found that the transformers of the HF community did not use all graphics cards when running LoRA. At present, he has submitted the first patch in a targeted manner, and is trying to completely solve the problem of LoRA multi-card operation.

If you have eight cards and only want to use two of them

There is an interesting problem here, the main reason is also because of transformers, if you have multiple cards, when you specify device_map, if it is not auto , but your available graphics card does not start from serial number 0, then various errors will occur.

A relatively simple solution that does not need to modify the code is to use the docker --gpus parameter to block the graphics card resources that do not need to be displayed to the application. For example, I need to skip the first four cards:

docker run -it --rm --gpus '"device=4,5,6,7"' ubuntu nvidia-smi

Then, after you execute the command, you will find that the previously occupied card “disappears”, and the program will renumber the card you specified as “cuda0~cuda3”, and all problems will be solved.

Last

That’s all for fine-tuning models with low-cost graphics resources.

In the next related article, we will talk about several other different ways of running the model, and find opportunities to talk about other models. Of course, if the model fine-tune goes well, I will continue to update the development notes of “Jarvis”.

–EOF

We have a small tossing group, which gathers some friends who like tossing.

In the absence of advertisements, we will chat about software and hardware, HomeLab, and programming issues together, and will also share some technical information in the group from time to time.

Friends who like tossing, welcome to read the following content, scan the code to add friends.

Some suggestions and opinions about “making friends”

When adding a friend, please note the real name and company or school, and indicate the source and purpose, otherwise it will not pass the review.

Those things about tossing the group into the group

This article uses the “Signature 4.0 International (CC BY 4.0)” license agreement. You are welcome to reprint or re-use it, but you need to indicate the source. Attribution 4.0 International (CC BY 4.0)

Author of this article: Su Yang

Created: March 25, 2023
Statistical word count: 12339 words
Reading time: 25 minutes to read
Link to this article: https://soulteary.com/2023/03/25/model-finetuning-on-llama-65b-large-model-using-docker-and-alpaca-lora.html