This experiment aims to run the Lora fine-tuning of GLM2 with reference to the official package of MindFormers. To do this, we first need to prepare the following resources and environment:
MindFormers Official Kit: This includes relevant code, models and tools for fine-tuning Lora for GLM2. You can directly visit the official website of MindFormers
Hardware resources: In order to run GLM2’s Lora fine-tuning, we need the following hardware resources:
- Atlas800-9000 training server
- Computing centers in various places (Shengteng)
- Servers with Ascend training cards
Software environment: In addition to hardware resources, we also need to prepare the following software environment
-
Full model download address: https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/glm2/glm2_6b.ckpt
-
tokenizer: https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/glm2/tokenizer.model
-
Data set: https://cloud.tsinghua.edu.cn/seafhttp/files/e5c9a5f1-54fb-4d7c-99ce-88f82ce25d33/AdvertiseGen.tar.gz
-
BMS (AICC) image: docker pull swr.cn-central-221.ovaijisuan.com/wuh-aicc_dxy/baichuan_mindformers:mindformers1.0.0dev-mindspore2.0.0-cann6.3rc1-py_3.9-euler_2.8
Note: The required version of the underlying driver and firmware needs to be greater than or equal to C84. (Currently, most of the underlying drivers in computing centers around the world are C81–2023-10-31)
# --device is used to control the running NPU card number and range of the specified container # -v is used to map directories outside the container # --name is used to customize the container name docker run -itd --ipc=host \ --network host \ --entrypoint=/bin/bash \ --device=/dev/davinci0 \ --device=/dev/davinci1 \ --device=/dev/davinci2 \ --device=/dev/davinci3 \ --device=/dev/davinci4 \ --device=/dev/davinci5 \ --device=/dev/davinci6 \ --device=/dev/davinci7 \ --device=/dev/davinci_manager \ --device=/dev/devmm_svm \ --device=/dev/hisi_hdc \ -v /work:/work \ -v /etc/localtime:/etc/localtime \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \ -v /usr/local/sbin/:/usr/local/sbin/ \ -v /usr/bin/hccn_tool:/usr/bin/hccn_tool \ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ -v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \ -v /var/log/npu/slog/:/var/log/npu/slog \ -v /var/log/npu/profiling/:/var/log/npu/profiling \ -v /var/log/npu/dump/:/var/log/npu/dump \ -v /var/log/npu/:/usr/slog \ IMAGE_ID #Image id
Use docker exec -it -u 0 container ID /bin/bash to enter the container
.
lora fine-tuning
Full-parameter fine-tuning can achieve good results on fine-tuned data sets, but there is a phenomenon of forgetting pre-training knowledge.
Therefore, it is recommended to use a low-parameter fine-tuning algorithm to freeze the original model weights and only train on a small number of parameters. This can achieve good results on the fine-tuning data set and alleviate the phenomenon of model forgetting.
When using the LoRA algorithm for low-parameter fine-tuning, use the configs/glm2/run_glm2_6b_lora.yaml
configuration file, which contains the configuration items required for the lora low-parameter fine-tuning algorithm.
Modify the data set/model weight configuration path:
-
Dataset: Modify the
dataset_dir
oftrain_dataset
in themindformers/configs/glm2/run_glm2_6b_lora.yaml
script to the data set path generated previously. -
Load pre-trained model weights: Modify
load_checkpoint
in themindformers/configs/glm2/run_glm2_6b_lora.yaml
script to the pre-trained model weight path.
As shown in the picture:
All parameters need to be modified, such asload_checkpoint, dataset_dir, vocab_file
and other paths.
Single card fine-tuning
First, BMS needs to import the environment variables RANK_ID and RANK_TABLE_FILE to use npu resources for fine-tuning/training;
export RANK_ID=0 export RANK_TABLE_FILE=xxx/hccl_8p.json # Where hccl_8p.json represents the file generated using python ./mindformers/tools/hccl_tools.py --device_num "[0,8)".
Modify the configs/glm2/run_glm2_6b_lora.yaml
configuration file and set use_parallel to False
cd scripts # Usage Help: bash run_stanalone.sh [CONFIG_PATH] [DEVICE_ID] [RUN_STATUS] bash run_standalone.sh ../configs/glm2/run_glm2_6b_lora.yaml 0 finetune
Training log path: mindformers/scripts/mf_standalone/
Checkpoint storage path: mindformers/scripts/mf_standalone/output/checkpoint
Or run the following command:
python run_mindformer.py --config configs/glm2/run_glm2_6b_lora.yaml --run_mode finetune --device_id 0
The following is the generated model file:
Single-machine multi-card fine-tuning
First, you need to modify the configs/glm2/run_glm2_6b_lora.yaml
configuration file and set use_parallel to True . At the same time, modify the following configuration:
use_parallel: True parallel: parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid gradients_mean: False loss_repeated_mean: True enable_alltoall: False full_batch: True search_mode: "sharding_propagation" enable_parallel_optimizer: True # optimizer shard strategy_ckpt_config: save_file: "./ckpt_strategy.ckpt" only_trainable_params: False #Set to False to save all parameters in the policy file parallel_config: data_parallel: 8 # Eight devices (NPU cards), unmatched parallel strategy configuration below related data model_parallel: 1 pipeline_stage: 1 expert_parallel: 1 micro_batch_num: 1 vocab_emb_dp: True gradient_aggregation_group: 4 micro_batch_interleave_num: 1
Enter mindformers on the machine, enter the scripts directory, and execute the following command to perform multi-card fine-tuning.
bash run_distribute.sh /work/mindformers/mindformers/hccl_8p.json ../configs/glm2/run_glm2_6b_lora.yaml [0,8] finetune 8
Where /work/mindformers/mindformers/hccl_8p.json
is RANK_TABLE_FILE
which is the total rank table file summarized and distributed in the previous step;
Model reasoning
Quick reasoning based on Pipeline
import mindspore from mindformers import AutoConfig, AutoModel, AutoTokenizer #Specify the graph mode and specify the training card id to use mindspore.set_context(mode=0, device_id=0) tokenizer = AutoTokenizer.from_pretrained('glm2_6b') # There are two ways to instantiate model. Just choose one of them to instantiate. # 1. Instantiate directly according to the default configuration model = AutoModel.from_pretrained('glm2_6b') # 2. Instantiate after customizing the configuration config = AutoConfig.from_pretrained('glm2_6b') config.use_past = True # Modify the default configuration here and enable incremental inference to speed up inference performance. # config.xxx = xxx # Customize and modify other model configurations according to needs model = AutoModel.from_config(config) # Instantiate the model from custom configuration items inputs = tokenizer("Hello")["input_ids"] # The first call to model.generate() for inference will include graph compilation time, and the inference performance display is inaccurate. Repeat the call multiple times to obtain accurate inference performance. outputs = model.generate(inputs, max_new_tokens=20, do_sample=True, top_k=3) response = tokenizer.decode(outputs) print(response) # ['Hello, as an artificial intelligence assistant, I welcome you to ask me questions at any time. ']