[Practical practice of fine-tuning large models] Using Peft technology and your own data set to fine-tune large models

Personal blog: Sekyoro’s blog cabin
Personal website: Proanimer’s personal website
This is a very popular topic during this period. Large models have many parameters and occupy a large space, which is difficult to train and generally requires fine-tuning technology for specific tasks.
AnimeBot.ipynb – Colaboratory (google.com)My complete code

What is large model LLM

LLM, short for Large Language Model, is the latest innovation in artificial intelligence and machine learning. In December 2022, with the release of ChatGPT, this powerful new artificial intelligence went viral on the Internet. For those open-minded enough to live outside the buzz of artificial intelligence and the tech news cycle, ChatGPT is a chat interface running on an LLM called GPT-3.

The latest big models are Meta’s llama2, of course openai’s GPT4, Google’s PaLM2. In China, there are Tsinghua’s ChatGLM and so on.

Large model fine-tuning is to change its parameters or some layers on this basis to better cope with some downstream tasks. When you want to adapt a pre-existing model to a specific task or field, fine-tuning the model is crucial in machine learning. The decision to fine-tune your model depends on your goals, which are often domain or task specific.

There are many techniques for fine-tuning now. These techniques are all designed to solve their own specified tasks and generally require specific data.

There are generally three methods involved. Prompt Engineering, embedding and finetune are fine-tuning.

Prompt Engineering

To put it simply, it means giving some known information in advance when talking to the model.

This approach is simple, but due to the limitations of prompt size and associated costs of passing large text to LLM, using large document sets or web pages as input to LLM is not optimal.

Embeddings

Embedding is a way of representing information, whether text, images or audio, into digital form

Embedding works well when a large number of documents or web pages need to be passed to LLM. This approach works well, for example, when a chatbot is built to provide users with responses to a set of policy documents.

When using it, text and other content needs to be generated into embedding, which requires the seq2seq model to be embedded. When the user wants to query LLM, the embedding will be retrieved from the vector storage and passed to LLM. LLM uses embedding to generate responses from custom data.

Fine tuning

Fine-tuning is a way of teaching a model how to handle input queries and how to represent responses. For example, LLM can be fine-tuned by providing data on customer reviews and corresponding sentiments.

Fine-tuning is typically used to tune the LLM for a specific task and obtain a response within that range. The task could be email classification, sentiment analysis, entity extraction, generating product descriptions based on specifications, etc.

Specific fine-tuning technologies include Lora, QLora, Peft, etc.

Fine tuning technology

old school

In the old-school approach, there are various ways to fine-tune pre-trained language models, each tailored to specific needs and resource constraints.

Feature-based: It uses pre-trained LLM as feature extractor to convert the input text into a fixed-size array. A separate classifier network predicts the classification probability of text in NLP tasks. During training, only the weights of the classifier change, which makes it resource-friendly but potentially less performant.
Fine-tuning I: Fine-tuning I enhances the pre-trained LLM by adding additional dense layers. During training, only the weights of newly added layers are adjusted while keeping the pre-trained LLM weights frozen. In experiments,it shows slightly better performancethan feature-based methods.
Fine-tuning II: In this approach, the entire model, including the pre-trained language model (LLM), is unfrozen for training, allowing all model weights to be updated. However, it can lead tocatastrophic forgetting, where new features overwrite old knowledge. Trim II is resource intensive but provides superior results when maximum performance is required. General language model fine-tuning
ULMFiT is a transfer learning method that can be applied to NLP tasks. It involves a 3-layer AWD-LSTM architecture for representation. ULMFiT is a method for fine-tuning pre-trained language models for specific downstream tasks.
Gradient-based parameter importance ranking: These methods are used to rank the importance of features or parameters in a model. In gradient-based ranking, the importance of a parameter depends on how much accuracy decreases when excluding the parameter. In random forest-based ranking, the impurity reduction for each feature can be averaged and the features ranked based on this metric.

Leading-edge strategy for LLM fine-tuning

Low-Rank Adaptation (LoRA): LoRA is a technique for fine-tuning large language models. It uses low-rank approximation methods to reduce the computational and financial costs of adapting models with billions of parameters, such as GPT-3, to specific tasks or domains.
Quantized LoRA (QLoRA): QLoRA is an efficient fine-tuning method for large language models (LLMs) that significantly reduces memory usage while maintaining full 16-bit fine-tuning performance. It achieves this by backpropagating the gradients of a frozen 4-bit quantized pretrained language model into a low-rank adapter.
Parameter Efficient Fine-tuning (PEFT): PEFT is an NLP technology that reduces computing and storage costs by fine-tuning only a small set of parameters, allowing pre-trained language models to effectively adapt to various applications. It eliminates catastrophic forgetting, tunes key parameters for specific tasks, and delivers performance comparable to comprehensive fine-tuning of modes such as image classification and stable diffusion dreambooth. This is a valuable approach to achieve high performance with minimal trainable parameters.
DeepSpeed: DeepSpeed is a deep learning software library for accelerating the training of large language models. It includes ZeRO (Zero Redundancy Optimizer), a memory-efficient approach to distributed training. DeepSpeed can automatically optimize fine-tuning jobs using Hugging Face’s Trainer API and provide an alternative script to run existing fine-tuning scripts.
ZeRO: ZeRO is a set of memory optimization technologies that enable efficient training of large models with trillions of parameters, such as GPT-2 and Turing NLG 17B. A major attraction of ZeRO is that no model code modifications are required. This is a memory-efficient form of data parallelism that allows you to access the aggregated GPU memory of all available GPU devices without the inefficiencies caused by data copying in data parallelism.

Nowadays, lora and its derivative methods and PEFT are generally used.

You can make the data set for fine-tuning yourself or find it everywhere, such as hugging face or Google dataset or github.

As for the model, it is generally called directly using tool libraries such as hugging face or langchain. There is no need to download it manually. After obtaining general language or other types of data, preprocessing steps such as embedding are generally required. The embedding model generally needs to be consistent with the model that handles the task. There must be a corresponding relationship.

Next, Hugging Face’s transformers and other libraries are used to fine-tune large models. AutoModel, AutoTokenizer and AutoConfig are often used, by calling from_pretrained code>Get relevant information. The following is the general training process.

Training process

# Transformers installation
pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
pip install git + https://github.com/huggingface/transformers.git

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments

dataset = load_dataset("yelp_review_full")
#dataset["train"][100]

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))


model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

training_args = TrainingArguments(output_dir="test_trainer")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

The above compute_metrics is used to evaluate the model. training_args is the parameter set during training.

import numpy as np
import evaluate

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

You can use trainer.push_to_hub() to push to your own warehouse. This will automatically add the training hyperparameters, training results and framework version to your model card

PEFT training adapters

Adapters trained with PEFT are also typically an order of magnitude smaller than full models, making them easier to share, store, and load. Usually paired with Lora model.

from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "ybelkada/opt-350m-lora"
model = AutoModelForCausalLM.from_pretrained(peft_model_id)

To load and use the PEFT adapter type, please ensure that the Hub repository or local directory contains the adapter_config.json file and adapter weights.

You can also load the basic model first and then use load_adapter

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-350m"
peft_model_id = "ybelkada/opt-350m-lora"

model = AutoModelForCausalLM.from_pretrained(model_id)
model.load_adapter(peft_model_id)

load_in_8bit and device_map relate to where to place the model and how much it occupies.

Add adapter

from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer
from peft import PeftConfig

model_id = "facebook/opt-350m"
model = AutoModelForCausalLM.from_pretrained(model_id)

lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    init_lora_weights=False
)

model.add_adapter(lora_config, adapter_name="adapter_1")

Train an adapter

from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)
model.add_adapter(peft_config)
trainer = Trainer(model=model, ...)
trainer.train()
model.save_pretrained(save_dir)
model = AutoModelForCausalLM.from_pretrained(save_dir)

Each PEFT method is defined by the PeftConfig class, which stores all the important parameters used to build the PeftModel.

from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)

peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

Use the get_peft_model function to wrap the basic model and peft_config to create a PeftModel. And use print_trainable_parameters to print the parameters that need to be updated.

from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_model

model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Save and push the model to the warehouse

model.save_pretrained("output_dir")
model.push_to_hub("my_awesome_peft_model")

This only saves incremental trained PEFT weights, meaning it’s very efficient at storing, transferring and loading. For example, the bigscience/To_3B model trained using LoRA on the twitter_complaints subset of the RAFT dataset contains only two files: adapter_config.json and adapter_model.bin.

Download model

The logic of the following method is to first obtain the configuration of peft through PeftConfig, obtain the location of the basic model, use the basic model to obtain its model and tokenizer, and finally use PeftModel to obtain the model.

from transformers import AutoModelForSeq2SeqLM
from peft import PeftModel, PeftConfig

peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
config = PeftConfig.from_pretrained(peft_model_id)
  model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
  tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

  model = model.to(device)
  model.eval()
  inputs = tokenizer("Tweet text : @HondaCustSvc Your customer service has been horrible during the recall process. I will never purchase a Honda again. Label :", return_tensors="pt")

  with torch.no_grad():
      outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
      print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])
  'complaint'

You can also simply use

from peft import AutoPeftModelForCausalLM
peft_model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora")

from peft import AutoPeftModel
model = AutoPeftModel.from_pretrained(peft_model_id)

Practical combat

Download required packages

Generally, they are hugging face transformers, datasets and xformers, accelerate, trl, bitsandbytes, peft and other libraries

!pip install -Uqqq pip --progress-bar off
!pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq transformers==4.32.1 --progress-bar off
!pip install -qqq datasets==2.14.4 --progress-bar off
!pip install -qqq peft==0.5.0 --progress-bar off
!pip install -qqq bitsandbytes==0.41.1 --progress-bar off
!pip install -qqq trl==0.7.1 --progress-bar off

Data processing

There are many ways to process data and there are many implementation methods. Here we mainly use pandas and datasets to process csv data.

animes_dataset = load_dataset("csv", data_files = "/content/animes.csv")
reviews_dataset = load_dataset("csv", data_files = "/content/reviews.csv")
animes_df = pd.DataFrame(animes_dataset["train"])
reviews_df = pd.DataFrame(reviews_dataset["train"])
merged_df = pd.merge(animes_df,reviews_df,left_on="uid",right_on="anime_uid")
# remove /n/r
def clean_text(x):
  #remove multiple whitespace
  new_string = str(x).strip()
  pattern = r"\s{3,}"
  new_string = re.sub(pattern, " ", new_string)
  #remove \r \
 \t
  pattern = r"[\
\r\t]"
  new_string = re.sub(pattern,"", new_string)
  return new_string
merged_df["synopsis"] = merged_df["synopsis"].map(clean_text)
merged_df["text"] = merged_df["text"].map(clean_text)
# split merged_df into train and test
train_df, test_df = train_test_split(merged_df, test_size=0.1, random_state=42)

dataset_dict = DatasetDict({<!-- -->
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(test_df)
})
DEFAULT_SYSTEM_PROMPT = "Below is a name of an anime,write some intro about it" #@param {type:"string"}
DEFAULT_SYSTEM_PROMPT = DEFAULT_SYSTEM_PROMPT.strip()

def generate_training_prompt(data_point):
  # Remove square brackets and spaces from the string
  genres = data_point["genre"].strip("[]").replace(" ", "").replace("'","")
  synopsis_len = len(data_point["synopsis"])
  split_len = random.randint(1,synopsis_len)
  synopsis_input = data_point["synopsis"][1:split_len]

  input = data_point["title"] + genres + synopsis_input
  output = data_point["synopsis"] + data_point["text"]
  return {<!-- -->
      "text":f"""### Instruction: {<!-- -->DEFAULT_SYSTEM_PROMPT}
            ### Input:
            {<!-- -->input.strip()}

            ### Response:
            {<!-- -->output.strip()}
            """.strip()
  }
def process_dataset(data: Dataset):
    return (
        data.shuffle(seed=42)
        .map(generate_training_prompt)
        .remove_columns(
              [
                "uid_x",
                "aired",
                "members",
                "img_url",
                "uid_y",
                "profile",
                "anime_uid",
                "score_y",
                "link_y"
            ]
        )
    )
dataset_dict["train"] = process_dataset(dataset_dict["train"])
dataset_dict["validation"] = process_dataset(dataset_dict["validation"])

The processing logic here is actually complicated. You only need to use pandas to read the data, then divide it into a training set and a test set and then convert it to a Dataset. In the middle, you need to remove some blank characters from the dataframe data.

Training settings

Due to the use of PEFT

lora_r = 16
lora_alpha = 64
lora_dropout = 0.1
lora_target_modules = [
    "q_proj",
    "up_proj",
    "o_proj",
    "k_proj",
    "down_proj",
    "gate_proj",
    "v_proj",
]

peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

Set trainingArgument and use trl for training.

OUTPUT_DIR = "experiments"
training_arguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,
)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)

Training and follow-up evaluation tests

trainer.train()
from peft import AutoPeftModelForCausalLM
# Load Lora adapter
# model = PeftModel.from_pretrained(
# base_model,
# "/content/Finetuned_adapter",
# )
# merged_model = model.merge_and_unload()
trained_model = AutoPeftModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    low_cpu_mem_usage=True,
)
merged_model = base_model.merge_and_unload()
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
# trainer.push_to_hub("anime_chatbot")
merged_model.push_to_hub("anime_chatbot")
print("Pushed to hub")
# @title test fine tune model
# @title test base model
DEFAULT_SYSTEM_PROMPT = "Below is a name of an anime,write some intro about it" #@param {type:"string"}
DEFAULT_SYSTEM_PROMPT = DEFAULT_SYSTEM_PROMPT.strip()
user_prompt = lambda input:f"""### Instruction: {<!-- -->DEFAULT_SYSTEM_PROMPT}
            ### Input:
            {<!-- -->input.strip()}
            ### Response:
            """.strip()
pipe = pipeline('text-generation',model=merged_model,tokenizer=tokenizer,max_length=150)

result = pipe(user_prompt("please introduce shingekinokyojin"))
print(result[0]['generated_text'])

Attention

from transformers import AutoModelForSeq2SeqLM
import torch
model_base = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

Here model_base is

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=512, out_features=50272, bias=False)
)

from peft import get_peft_model
lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    init_lora_weights=False
)
peft_model = get_peft_model(peft_model_base, lora_config)
peft_model.print_trainable_parameters()

Use lora_config to get peft_model

PeftModel(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 512, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
          (project_out): Linear(in_features=1024, out_features=512, bias=False)
          (project_in): Linear(in_features=512, out_features=1024, bias=False)
          (layers): ModuleList(
            (0-23): 24 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear(
                  in_features=1024, out_features=1024, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Identity()
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1024, out_features=8, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=8, out_features=1024, bias=False)
                  )
                  (lora_embedding_A): ParameterDict()
                  (lora_embedding_B): ParameterDict()
                )
                (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (q_proj): Linear(
                  in_features=1024, out_features=1024, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Identity()
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1024, out_features=8, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=8, out_features=1024, bias=False)
                  )
                  (lora_embedding_A): ParameterDict()
                  (lora_embedding_B): ParameterDict()
                )
                (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
              )
              (activation_fn): ReLU()
              (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
              (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            )
          )
        )
      )
      (lm_head): Linear(in_features=512, out_features=50272, bias=False)
    )
  )
)

Use peft_model.merge_and_unload() to get the fused model

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=512, out_features=50272, bias=False)
)

Some problems encountered

Data set processing, how to write fine-tuned templates

Examples found

14.fine-tuning-llama-2-7b-on-custom-dataset.ipynb – Colaboratory (google.com)

Fine_tuned_Llama_PEFT_QLora.ipynb – Colaboratory (google.com)

Use a template during training

DEFAULT_SYSTEM_PROMPT = """
Below is a conversation between a human and an AI agent. Write a summary of the conversation.
""".strip()


def generate_training_prompt(
    conversation: str, summary: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {<!-- -->system_prompt}

### Input:
{<!-- -->conversation.strip()}

### Response:
{<!-- -->summary}
""".strip()

during testing

def generate_prompt(
    conversation: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {<!-- -->system_prompt}

### Input:
{<!-- -->conversation.strip()}

### Response:
""".strip()

Is the model obtained after training a peftmodel or what type of model it is?

One way is

repo_id = "meta-llama/Llama-2-7b-chat-hf"
use_ram_optimized_load=False

base_model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    device_map='auto',
    trust_remote_code=True,
)

base_model.config.use_cache = False

base_model is a LlamaForCausalLM class, used after training

trainer.save_model("Finetuned_adapter")Save the model, and then use PeftModel.from_pretrained to get the PeftModel

model = PeftModel.from_pretrained(
    base_model,
    "/content/Finetuned_adapter",
    )
merged_model = model.merge_and_unload()

Then save the model

merged_model.save_pretrained("/content/Merged_model")
tokenizer.save_pretrained("/content/Merged_model")

The other is to use AutoPeftModelForCausalLM

from peft import AutoPeftModelForCausalLM

trained_model = AutoPeftModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    low_cpu_mem_usage=True,
)

merged_model = model.merge_and_unload()
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")

Reference materials

Training Large Language Model (LLM) on your data | by Mohit Soni | Walmart Global Tech Blog | Aug, 2023 | Medium
A Practical Introduction to LLMs | By: Shawhin Talebi | Towards Data Science
The Ultimate Guide to LLM Fine Tuning: Best Practices & Tools | Lakera – Protecting AI teams that disrupt the world.
tutorial https://learn.deeplearning.ai/finetuning-large-language-models

If you have any questions, you are welcome to communicate!

Server configuration
Pagoda: Pagoda server panel, one-click all-round deployment and management
Cloud server: Alibaba Cloud Server
Vultr server
GPU server:Vast.ai