GLM2 Lora fine-tuning based on MindFormers

This experiment aims to run the Lora fine-tuning of GLM2 with reference to the official package of MindFormers. To do this, we first need to prepare the following resources and environment:
MindFormers Official Kit: This includes relevant code, models and tools for fine-tuning Lora for GLM2. You can directly visit the official website of MindFormers

Hardware resources: In order to run GLM2’s Lora fine-tuning, we need the following hardware resources:

Atlas800-9000 training server
Computing centers in various places (Shengteng)
Servers with Ascend training cards

Software environment: In addition to hardware resources, we also need to prepare the following software environment

Full model download address: https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/glm2/glm2_6b.ckpt
tokenizer: https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/glm2/tokenizer.model
Data set: https://cloud.tsinghua.edu.cn/seafhttp/files/e5c9a5f1-54fb-4d7c-99ce-88f82ce25d33/AdvertiseGen.tar.gz
BMS (AICC) image: docker pull swr.cn-central-221.ovaijisuan.com/wuh-aicc_dxy/baichuan_mindformers:mindformers1.0.0dev-mindspore2.0.0-cann6.3rc1-py_3.9-euler_2.8

Note: The required version of the underlying driver and firmware needs to be greater than or equal to C84. (Currently, most of the underlying drivers in computing centers around the world are C81–2023-10-31)

# --device is used to control the running NPU card number and range of the specified container
# -v is used to map directories outside the container
# --name is used to customize the container name
docker run -itd --ipc=host \
--network host \
--entrypoint=/bin/bash \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /work:/work \
-v /etc/localtime:/etc/localtime \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /usr/bin/hccn_tool:/usr/bin/hccn_tool \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
IMAGE_ID #Image id

Use docker exec -it -u 0 container ID /bin/bash to enter the container.

lora fine-tuning

Full-parameter fine-tuning can achieve good results on fine-tuned data sets, but there is a phenomenon of forgetting pre-training knowledge.
Therefore, it is recommended to use a low-parameter fine-tuning algorithm to freeze the original model weights and only train on a small number of parameters. This can achieve good results on the fine-tuning data set and alleviate the phenomenon of model forgetting.
When using the LoRA algorithm for low-parameter fine-tuning, use the configs/glm2/run_glm2_6b_lora.yaml configuration file, which contains the configuration items required for the lora low-parameter fine-tuning algorithm.

Modify the data set/model weight configuration path:

Dataset: Modify the dataset_dir of train_dataset in the mindformers/configs/glm2/run_glm2_6b_lora.yaml script to the data set path generated previously.
Load pre-trained model weights: Modify load_checkpoint in the mindformers/configs/glm2/run_glm2_6b_lora.yaml script to the pre-trained model weight path.
As shown in the picture:

All parameters need to be modified, such as load_checkpoint, dataset_dir, vocab_file and other paths.

Single card fine-tuning

First, BMS needs to import the environment variables RANK_ID and RANK_TABLE_FILE to use npu resources for fine-tuning/training;

export RANK_ID=0
export RANK_TABLE_FILE=xxx/hccl_8p.json # Where hccl_8p.json represents the file generated using python ./mindformers/tools/hccl_tools.py --device_num "[0,8)".

Modify the configs/glm2/run_glm2_6b_lora.yaml configuration file and set use_parallel to False

cd scripts
# Usage Help: bash run_stanalone.sh [CONFIG_PATH] [DEVICE_ID] [RUN_STATUS]
bash run_standalone.sh ../configs/glm2/run_glm2_6b_lora.yaml 0 finetune

Training log path: mindformers/scripts/mf_standalone/

Checkpoint storage path: mindformers/scripts/mf_standalone/output/checkpoint

Or run the following command:

python run_mindformer.py --config configs/glm2/run_glm2_6b_lora.yaml --run_mode finetune --device_id 0

The following is the generated model file:

Single-machine multi-card fine-tuning

First, you need to modify the configs/glm2/run_glm2_6b_lora.yaml configuration file and set use_parallel to True . At the same time, modify the following configuration:

use_parallel: True
parallel:
  parallel_mode: 1 # 0-dataset, 1-semi, 2-auto, 3-hybrid
  gradients_mean: False
  loss_repeated_mean: True
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: True # optimizer shard
  strategy_ckpt_config:
    save_file: "./ckpt_strategy.ckpt"
    only_trainable_params: False #Set to False to save all parameters in the policy file
parallel_config:
  data_parallel: 8 # Eight devices (NPU cards), unmatched parallel strategy configuration below related data
  model_parallel: 1
  pipeline_stage: 1
  expert_parallel: 1
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
micro_batch_interleave_num: 1

Enter mindformers on the machine, enter the scripts directory, and execute the following command to perform multi-card fine-tuning.

bash run_distribute.sh /work/mindformers/mindformers/hccl_8p.json ../configs/glm2/run_glm2_6b_lora.yaml [0,8] finetune 8

Where /work/mindformers/mindformers/hccl_8p.json is RANK_TABLE_FILE which is the total rank table file summarized and distributed in the previous step;

Model reasoning

Quick reasoning based on Pipeline

import mindspore
from mindformers import AutoConfig, AutoModel, AutoTokenizer
#Specify the graph mode and specify the training card id to use
mindspore.set_context(mode=0, device_id=0)
tokenizer = AutoTokenizer.from_pretrained('glm2_6b')
# There are two ways to instantiate model. Just choose one of them to instantiate.
# 1. Instantiate directly according to the default configuration
model = AutoModel.from_pretrained('glm2_6b')
# 2. Instantiate after customizing the configuration
config = AutoConfig.from_pretrained('glm2_6b')
config.use_past = True # Modify the default configuration here and enable incremental inference to speed up inference performance.
# config.xxx = xxx # Customize and modify other model configurations according to needs
model = AutoModel.from_config(config) # Instantiate the model from custom configuration items

inputs = tokenizer("Hello")["input_ids"]
# The first call to model.generate() for inference will include graph compilation time, and the inference performance display is inaccurate. Repeat the call multiple times to obtain accurate inference performance.
outputs = model.generate(inputs, max_new_tokens=20, do_sample=True, top_k=3)
response = tokenizer.decode(outputs)
print(response)
# ['Hello, as an artificial intelligence assistant, I welcome you to ask me questions at any time. ']