RAG’s experience in fine-tuning vertical BGE

Article directory

  • Preface
  • Data Format
  • Part of the code
  • training parameters
  • next try
  • Summarize

Foreword

With the popularity of large models, many vertical industries have begun to use large models to optimize their businesses. The most typical method is RAG (Retrieval Augmentation Generation). To put it simply, it uses search technology to find the paragraphs most relevant to the user’s questions, and then lets LLM answer the user’s questions based on the above paragraphs. This kind of thing was actually done once when I was in CSDN. Reference: CSDN Q&A robot. However, it was fine-tuned on the SBERT model and achieved good results. The base model we use here is BAAI/bge-large-zh-v1.5.

Data format

{<!-- -->"query": str, "pos": List[str], "neg":List[str]}
{<!-- -->"query": "What are the methods for children to undergo gastroscopy", "pos": ["There are two common methods for children to undergo gastroscopy: 1. If there is pain during gastroscopy, no Do a gastroscopy under anesthesia; 2. Painless gastroscopy, do a gastroscopy under anesthesia. A common method in China, anesthesia is not recommended. Because anesthesia in children has many side effects, including the side effects of the anesthetic itself and recovery after anesthesia, All have side effects. Without anesthesia, that is, without using anesthetics, the current common method is to catch the child and do a gastroscopy. Although this situation recovers quickly, it also has disadvantages. Because the child is afraid of doing this Examination. It is more painful to do the examination without anesthesia and very clearly. Although the process is relatively short, it will cause a certain shadow and pressure on the future psychology. The comprehensive pros and cons are that each of these two has its own advantages and disadvantages. Place. For children with good cooperation, it is recommended not to use anesthesia. For children who are uncooperative or even extremely uncooperative, it is recommended to undergo gastroscopy under anesthesia for relevant examinations."], "neg": ["Gastroscopy is clinically important It is a very commonly used and very effective examination method. In fact, abroad, especially in Japan, some European and American countries, it is listed as a physical examination. It is not only when you have symptoms, but when you reach a certain age. Do this examination. Some early gastric lesions, such as early gastric cancer and atrophic gastritis, are clinically asymptomatic or have very mild symptoms and are difficult to detect early, so gastroscopy is very important. Gastroscopy is generally divided into two parts Methods: 1. Ordinary gastroscopy. Give the patient a tube of lidocaine jelly orally, which mainly acts to anesthetize the pharynx and reduce the pain of gastroscopy. Gastroscopy can be performed in a regular awake state. 2. None Painful gastroscopy. Painless gastroscopy is actually a kind of anesthesia gastroscopy. It uses an intravenous anesthesia method to make him fall asleep, so that there will be no painful reaction, and then do a gastroscopy. The effects of these two examinations are the same, but if The patient is more sensitive and has a strong pharyngeal reflex. It is recommended to do a painless gastroscopy, which will provide better inspection results. Regarding the cost, an ordinary gastroscopy plus some blood tests must be done before the gastroscopy, mainly to screen for infectious diseases. , the cost is about 500-800 yuan. If it is an anesthesia gastroscopy or a painless gastroscopy, the cost is relatively speaking, because it has the cost of anesthesia, including the use of disposable consumables, it may cost about 2,000 yuan.", " A gastroscopy is not something you can do on the same day as a doctor, nor can you do it at any time. You must make an appointment before a gastroscopy. When Chinese people see a doctor, in many cases, they have to see a doctor when they come in, or they have to have an examination when they enter the hospital. This habit is gradually becoming more and more common. To change, you must make an appointment in advance. Because you need to meet certain requirements for gastroscopy: First, your body must meet certain requirements. You must not have serious cardiopulmonary disease, serious mental illness, oral cavity, pharynx and other symptoms. These conditions cannot Do a gastroscopy. If the cervical spine is severely dislocated, you cannot do a gastroscopy. There are contraindications for doing a gastroscopy. Before doing a gastroscopy, the doctor needs to explain it clearly to the patient and ask the patient to sign an informed consent form. If you choose to have a painless gastroscopy, you need to make an appointment in advance. The doctor needs to make it clear that painless gastroscopy cannot be done under certain circumstances because the use of painless drugs can induce the onset of certain diseases. You must make an appointment before having a gastroscopy, and the doctor will tell you the precautions. The basic precautions are that if you have a gastroscopy the next morning, you need to fast. You should fast after dinner the night before, and there is no need to fast on the next morning. It is best to fast for 8 hours. Even in this case, some people have poor gastric motility and will develop gastric retention after gastroscopy is inserted. Gastroscopy is ineffective if gastric retention occurs. Firstly, after gastric retention, the diseased part cannot be seen because there is a lot of food in the stomach. Secondly, when a painless gastroscopy is performed during gastric obstruction, reflux will occur and cause suffocation, which is a very dangerous situation. Therefore, you need to fast and drink for a certain period of time before having a gastroscopy. When undergoing gastroscopy for certain special diseases, the doctor will inform you of precautions. For example, patients with diabetes and high blood pressure need to undergo a gastroscopy. Diabetic patients need to inject insulin or take oral hypoglycemic drugs in the morning. Gastroscopy requires fasting on the morning of the surgery. Therefore, for patients with diabetes and hypertension, the doctor will provide medication instructions on the morning of the surgery and tell them when to do so. Take your medicine. In short, there are many requirements before having a gastroscopy. You must go to the endoscopy room to contact the doctor to confirm the precautions before doing the gastroscopy. ", "Gastroscopy is the gold standard for detecting gastric cancer. It is divided into ordinary gastroscopy, sedation gastroscopy, capsule gastroscopy and general anesthesia gastroscopy. Gastric mucosal lesions can be found through examination, and gastric cancer can also be diagnosed by pathological biopsy. In addition, CT examination can detect the local invasion of gastric tumors, including metastasis to the liver, lungs or other organs, to determine the tumor stage and lymph node stage, and then determine whether the patient is suitable for surgical treatment. If not, then It is recommended to use chemotherapy or radiotherapy to choose the best treatment plan for the patient. Among them, patients need to fast early in the morning on the day of gastroscopy and take local anesthetics orally to reduce throat reactions during the examination; or they may be given sedatives or slightly heavier doses of anesthetics to perform gastroscopy under sedation or general anesthesia to reduce symptoms. Patient discomfort. ", "Patients who have had it may know this problem very well, but patients who have never had it will have doubts, especially many patients who are afraid of gastroscopy. I often meet patients who tell me that after hesitating for months and not daring to do it, they finally made up their minds and finally did it. It turned out that it turned out that gastroscopy was not that uncomfortable. Of course, he had a painless gastroscopy. In fact, gastroscopy is a very simple process from a layman's perspective. The patient lies on the left side of the hospital bed, and then the doctor passes a flexible tube from the mouth and throat into the esophagus, then into the stomach and into the duodenum. During the process of advancing and retreating, each part is carefully observed. The doctor will draw conclusions and diagnoses during this process, so it is still a relatively simple and quick examination. Experienced doctors usually do it within 5 minutes. Of course, if there are special circumstances, such as some lesions in the stomach that require biopsy or treatment, it may take a little longer. ", "You need to fast for 6-8 hours before the gastroscopy. Eat low-residue, easily digestible food for dinner the day before. Avoid eating spicy and hot foods on the day of the examination. Remove the removable dentures before the examination. Patients with high blood pressure and diabetes need to take medications selectively under the guidance of a doctor based on their actual situation. Nausea, vomiting, abdominal distension, abdominal pain, etc. may occur during gastroscopy. Most patients can successfully complete the examination by actively cooperating with the staff's guidance. Those who use local anesthesia before the examination should not drink or eat until 2 hours after the operation to avoid choking and coughing. Suspend aspirin, warfarin, Plavix, Taiga and other drugs for 1-2 days after surgery according to individual conditions; if the patient develops severe abdominal pain, vomiting, bleeding, etc. after the examination, timely medical treatment is required. ", "The precautions before having a gastroscopy are as follows: 1. According to the doctor's judgment, the patient's condition should meet the indications for examination; 2. Keep an empty stomach on the morning of the examination, and conduct a blood test before the examination; 3. Stop taking some drugs before the examination, such as Those who take aspirin, Panax notoginseng and other blood-activating and blood-stasis-removing and anticoagulant drugs should try to stop taking them for one week under the guidance of a doctor before the examination to ensure the safety of the examination; because suspicious lesions may be found during the examination, a biopsy is required. Anticoagulants may cause more bleeding. ", "When performing a painless gastroscopy, the patient must ensure that the food in the stomach has been completely emptied. Some patients have gastric retention or gastric motility disorders, and it is not recommended to undergo a gastroscopy. The doctor will inject some short-acting anesthetic drugs into the patient to put the patient in a sleeping state. The medical staff will use special monitoring equipment to monitor the patient's vital signs, such as heart rate, respiration, blood oxygen, etc. Finally, the doctor will slowly insert the gastroscope tube into the stomach. Observe within. The patient's feeling during the procedure is mild, and in most cases no special discomfort will occur. After the examination, the doctor will wake the patient up. The patient will have some symptoms of dizziness and discomfort in the early stage, which will generally disappear slowly on their own. After the gastroscopy, the doctor should instruct the patient to try not to eat indigestible food and irritating and spicy food within 3 days. ", "Painless gastroscopy uses the anesthesiologist to inject a short-acting anesthetic drug into the patient's veins. After the anesthetic drug is infused, the patient quickly falls asleep. Then, while the patient is asleep, the endoscopist performs a routine endoscopy. After the operation, the patient will wake up immediately. Therefore, there are certain hazards in performing painless gastroscopy during this process. The specific hazards are as follows: 1. Some patients will have allergic reactions or poisoning reactions to narcotic drugs, accelerated breathing or heartbeat, anesthesia accidents, and even coma; 2. Anesthetic drugs have a certain degree of respiratory depression. When there is respiratory obstruction or respiratory depression, the patient will have difficulty breathing, which is also more dangerous. 3. Some patients will have gastric contents reflux into the trachea and cause anesthesia accidents. Some patients have respiratory conditions, such as cough, asthma, and cardiac insufficiency. Painless gastroscopy is a contraindication for these patients, and they cannot undergo painless gastroscopy at all. ", "Gastroscopy is an invasive examination that can cause certain physical pain to the patient. Patients who are older or have a history of heart disease should first undergo an electrocardiogram to determine whether symptoms such as stomach pain are caused by heart disease. Gastroscopy in patients with myocardial infarction can directly damage the heart and should be avoided as much as possible. In addition, patients with infectious diseases such as hepatitis and HIV are not suitable for gastroscopy because repeated use of gastroscopy may cause iatrogenic cross-contamination. ", "Painless gastroscopy requires general anesthesia, and its hazards are as follows: 1. Intravenous anesthesia can cause respiratory depression, choking, nausea, and vomiting. Especially when the stomach is full, the possibility of nausea and vomiting in patients will be greatly increased. Therefore, it is necessary to fast for 6-8 hours before general anesthesia; 2. After the patient is anesthetized, the patient will lose consciousness and the throat reflex will disappear. Aspiration may occur. In severe cases, it may lead to death from suffocation on the spot. Therefore, if the general anesthesia surgery is not an emergency, sufficient fasting time must be ensured. "]}

For specific examples, please refer to the official one: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune

In fact, the official readme has made it very clear. I will not go into details about what the official said. I will mainly talk about the construction process of the data set here.

I made two attempts:

The first

 1. Use bge basic model to vectorize all data
 2. Recall the top 10 for each query
 3. Calculate the rearrangement scores of the top 10
 4. Calculate the bge cos similarity, jaccard coefficient, lcs, edit distance, etc. of query, answer, target_query, and target_answer in pairs.
 5. Use rearrangement score, bge similarity, jaccard and other similarity combination strategies to filter out training data

There are probably the following fields:

writer.writerow(["query", "answer", "target_query", "target_answer", "rerank_score",
      "query_tgt_query_bge_cos", "query_tgt_query_lcs_score", "query_tgt_query_edit_dist", "query_tgt_query_jaccard",
      "answer_tgt_query_bge_cos", "answer_tgt_query_lcs_score", "answer_tgt_query_edit_dist", "answer_tgt_query_jaccard",
      "query_tgt_answer_bge_cos", "query_tgt_answer_lcs_score", "query_tgt_answer_edit_dist", "query_tgt_answer_jaccard",
      "answer_tgt_answer_bge_cos", "answer_tgt_answer_lcs_score", "answer_tgt_answer_edit_dist", "answer_tgt_answer_jaccard"
       ])

To be honest, it’s a lot. It’s not easy to combine so many fields to filter data.

The second kind

 1. Use bge basic model to vectorize all data
 2. Recall top100 for each query
 3. Filter out the data with 0.4 < distance <= 0.7 as negative samples

It can be seen that the second method is much simpler. Use the query top100 to set the threshold to filter out the training data (the thresholds are different for different data sets and can be set according to your actual situation).

Explain:
My scenario is query-passage, which matches the passage most relevant to query. The construction of the positive example is generated using baichuan2-13b-chat, allowing LLM to generate question and answer pairs based on a given paragraph. query is the generated question, and passage is a given paragraph, and query-passage forms a positive example pair. Of course, the data generated by LLM also needs to be filtered. In the end, less than 10,000 items were generated. The negative example is constructed using Generated by the above two methods. My prompt (for reference only).

The reranking model mentioned above uses bge-reranker-large

 prompt = f"""
'''
{<!-- -->paragraph}
'''

Please extract question and answer pairs from the above literature paragraphs. Be sure to strictly abide by the following requirements: The extracted questions and answers must contain subject, predicate, and object. Words that refer to this study, this experiment, etc. that are unclear must not appear. Please ensure that the extracted questions must The answer can be found in the original text. Please strictly follow the format in ''' for the format of the returned results.
'''
[
    {<!-- -->{
        "Question": "Question 1 content",
        "answer": "Answer to question 1"
    }},
    {<!-- -->{
        "Question": "Question 2 content",
        "answer": "Answer to question 2"
    }},
    {<!-- -->{
        "Question": "Question 3 content",
        "answer": "Answer to question 3"
    }}
]
'''
"""

In theory, the data constructed by the first method will be better, but in actual operation, you will find that this method will be very time-consuming. Even if you use multi-GPU parallelization, multi-process optimization, etc., it will still take nearly 200,000 data. In one day (depending on machine performance), I used this method to construct a batch of data. The trained model was dozens of points worse than the base model. The main reason was that the difficult negative samples were not constructed well. The difficult sample size is not enough. Students with plenty of time can try recalling the Top100. Theoretically, the effect should be better!

In the end, the second method was actually used. Compared with the top100 recall rate of the base model, the recall rate increased by 5.7%, and there is still room for optimization.

Part of the code

class BuildTrainData:
    def __init__(self, config, options):
        model_path = "bge-large-zh-v1.5"
        data_path = "src_data.csv"
        logger.info("Loading raw data...")
        self.data = pd.read_csv(data_path)
        logger.info(f"Load vectorized model from {<!-- -->model_path}...")
        self.model = SentenceTransformer(model_path)
        self.model.eval()
        self.batch_size = 32
        self.faiss_measure = faiss.METRIC_L2
        self.index_type = "HNSW64"

        file_name = data_path.split('/')[-1].split('.')[0]

        save_dir = "./data/models/bge_ft"
        if not os.path.exists(save_dir):
            os.makedirs(save_dir, exist_ok=True)
        self.embedding_path = f"{<!-- -->save_dir}/embedding_{<!-- -->file_name }.pkl"
        self.faiss_index_path = f"{<!-- -->save_dir}/faiss_{<!-- -->file_name }.index"
        self.bge_train_data_path = f"./data/datasets/bge/train/{<!-- -->embedding_name}_train.jsonl"

    def embedding(self, text_list):
        logger.info("Vectorization...")
        embeddings = self.model.encode(text_list, self.batch_size, show_progress_bar=True)
        return embeddings

    def embedding_mul_gpu(self, text_list):
        logger.info("Multi-GPU parallel vectorization...")
        # Specify the GPU through target_devices, such as target_devices=['cuda:0', 'cuda:1']
        pool = self.model.start_multi_process_pool()
        embeddings = self.model.encode_multi_process(text_list, pool, batch_size=self.batch_size)
        self.model.stop_multi_process_pool(pool)
        return embeddings
    
    def build_faiss_index(self):
        if os.path.exists(self.faiss_index_path):
            logger.info(f"{<!-- -->self.faiss_index_path} already exists...")
            faiss_index = faiss.read_index(self.faiss_index_path)
            embeddings = joblib.load(self.embedding_path)
            return faiss_index, embeddings

        logger.info("Loading vectorized data from local...")
        embeddings = joblib.load(self.embedding_path)
        dim = embeddings.shape[1]
        faiss_index = faiss.index_factory(dim, self.index_type, self.faiss_measure)
        logger.info("Build index...")
        faiss_index.add(embeddings)
        faiss.write_index(faiss_index, self.faiss_index_path)
        return faiss_index, embeddings


    def compute_retrival(self, mul_gpus=None, retrival_topk=100):
        logger.info("Mining difficult samples...")
        query_list = self.data["query"]

        # query = "Generate a representation for this sentence for retrieving related articles:" + row["query"]
        if not os.path.exists(self.embedding_path):
            logger.info("embedding file does not exist, re-embedding...")
            if not mul_gpus:
                logger.info("Only using one GPU...")
                query_embedding = self.embedding(self.data["text"])
            else:
                logger.info("Multi-GPU acceleration...")
                query_embedding = self.embedding_mul_gpu(self.data["text"])
            joblib.dump(query_embedding, self.embedding_path)
        faiss_index, query_embedding = self.build_faiss_index()

        logger.info("Start processing data...")
        distances, indexes = faiss_index.search(query_embedding, retrival_topk)

        for idx, query in enumerate(tqdm(query_list, desc="Mining difficult samples")):
            answer = self.data["text"][idx]
            if query in set(self.has_processed_list):
                # logger.info(f"{query} has been processed...")
                continue
            target_answers = []

            # The smaller the dist, the more similar it is
            neg_samples_tune = []
            for dist, df_idx in zip(*[distances[idx], indexes[idx]]):
                if df_idx == -1:
                    # logger.info(f"bade index {df_idx}")
                    continue

                target_query = self.data["query"][df_idx]
                if target_query == query:
                    continue
                target_answer = self.data["text"][df_idx]
                if target_answer == answer:
                    continue
                
                if dist > 0.4 and dist <= 0.7:
                    target_answers.append(target_answer)
                elif dist > 0.7:
                    neg_samples_tune.append(target_answer)

            
            if len(target_answers) == 0:
                # logger.info(f"query: {query} no negative samples")
                target_answers = neg_samples_tune
                if len(target_answers) == 0:
                    # logger.info(f"query: {query} no negative samples")
                    continue
            elif len(target_answers) > 10:
                target_answers = random.sample(target_answers, 10)
            
            meta = {<!-- -->
                "query": query,
                "pos": [answer],
                "neg": target_answers
            }

            with jsonlines.open(self.bge_train_data_path, 'a') as f:
                f.write(meta)

src_data.csv contains two fields, namely query and text. query is the question, and text is the containing field. The paragraph that answers the question

The overall process is quite simple. The most important thing is to reasonably construct difficult negative examples. This process requires trying different thresholds and analyzing whether the constructed data is accurate.

Students who have carefully read the official documents must have discovered that the official has actually done difficult sample mining, as follows:

After reading the source code, I found that the official difficult sample mining is just a simple example. It also extracts TopN data, but the official sample is randomly sampled directly from the range_for_sampling range. It is certainly not as good as ours from It is better to sample within the range of 0.3-0.7. In fact, you should sort by distance, and then take N from the samples with distance greater than 0.3 from small to large. strip.

Note: The smaller the distance is, the more similar it is.

Training parameters

torchrun --nproc_per_node 8 \
-m FlagEmbedding.baai_general_embedding.finetune.run \
--output_dir bge-large-zh-medical-v2.1 \
--model_name_or_path ./BAAI/bge-large-zh-v1.5 \
--train_data train_src_v2_train.jsonl \
--learning_rate 1e-5 \
--fp16 \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--dataloader_drop_last True \
--normlized True \
--temperature 0.02 \
--query_max_len 256 \
--passage_max_len 512 \
--train_group_size 6 \
--logging_steps 10 \
--logging_strategy steps \
--query_instruction_for_retrieval "" \
--report_to tensorboard \
--save_steps 100 \
--save_strategy steps \
--save_total_limit 10

I did not add query_instruction_for_retrieval during training here. The official suggestion is that it is best to add query_instruction_for_retrieval for the retrieval task, and the effect will be better. However, in my scenario, was added query_instruction_for_retrievalOn the contrary, the effect is worse, about 2%.

Today, I added query_instruction_for_retrieval and trained another version. Without adding query_instruction_for_retrieval during inference, the recall rate of top50 increased by about 3%, and the recall rate of top100 only increased by 0.3%. The overall effect is better, so this solution will be used to train on higher quality data sets in the future.

For an explanation of training parameters, please read the official instructions. To be honest, I think the official tutorials are very detailed:

Key points: batch_size must be large. If the video memory is not enough, use it together with gradient_accumulation_steps. Small batch_size during the training process loss is very jittery and the effect is very poor:
for example:

The loss in the same epoch is very jittery, and the effect is worse than the base model.

Let’s look at a big batch_size


Very stable.

Next attempts

1. Would it be better to use the trained vector model to construct difficult examples and then retrain the base model? (Matryoshka doll?)
2. Change the negative sample from the threshold interval sampling to the relatively more similar top K samples.

Summary

1. The leadership requires an increase of more than 10%. Without labeled data, it still feels very difficult.
2. If you guys have any ideas, please leave a message in the comment area to discuss together.