CCKS2019 Yidu Cloud 4k Electronic Medical Record Dataset Named Entity Recognition Based on Python

CCKS2019 Yidu Cloud 4k Electronic Medical Record Dataset Named Entity Recognition
Table of contents
CCKS2019 Yidu Cloud 4k Electronic Medical Record Dataset Named Entity Recognition 1
Dataset 1
Project Structure 3
Requirements 5
Step 5
Model 6
upstream 6
downstream 6
Config 8
Train 8
Strategy 8
log 9
Evaluate 14
Strategy 14
Evaluate a single model 14
Performance 14
Test set performance 14
Validation set best F1 16
The performance of 379 test samples provided by the government 16
379 test samples provided by the official F1 evaluation results for each category 16
Predict 16
model size
?Small version: two 3090 (24G), first trained with unsupervised MLM for 1 million steps (maxlen is 512), and then trained with supervised multi-task for 750,000 steps (maxlen ranges from 64 to 512, depending on the task) , batch_size is 512, optimizer is LAMB;
?Base version: four 3090 (24G), first trained with unsupervised MLM for 1 million steps (maxlen is 512), and then supervised multi-task training for 750,000 steps (maxlen ranges from 64 to 512, depending on the task) , batch_size is 512, optimizer is LAMB;
?Large version: two A100 (80G), first trained with unsupervised MLM for 1 million steps (maxlen is 512), and then trained with supervised multi-task for 500,000 steps (maxlen ranges from 64 to 512, depending on the task) , batch_size is 512, and the optimizer is LAMB.
Config
?maxlen The maximum sentence length of each batch in training, less than padding, more than truncation
?epochs maximum number of training epochs
?batch_size batch size
?bert_layers bert layers, small ≤ 4, base ≤ 12
?crf_lr_multiplier CRF layer amplified learning rate, scaling it up if necessary
?model_type model, ‘roformer_v2’
?dropout_rate dropout rate
?max_lr The maximum learning rate, the larger the bert_layers should be, the smaller it should be, small recommends 5e-5_{1e-4, base recommends 1e-5}5e-5
?lstm_hidden_units lstm hidden layer number
ATTENTION: Not all sentences must be filled to the same length, and each sample in each batch must be of the same length. So if the maximum length in the batch ≤ maxlen, the batch will be filled or truncated to the longest sentence length, if the maximum length in the batch ≥ maxlen, the batch will be filled or truncated to the maxlen in config.py
train
Strategy
Partition strategy
Divide 1000 training samples into training set, verification set and shuffle according to 8:2.
Optimization Strategy
? Use EMA (exponential moving average) moving average with Adam as an optimization strategy. The moving average can be used to estimate the local value of the variable, and the update of the variable is related to the historical value over a period of time. Its significance is to use the parameters of the moving average to improve the robustness of the model on the test data. EMA maintains a shadow variable for each variable to be updated for training and learning. The initial value of the shadow variable is the initial value of this variable.
?Because the BERT model already has pre-trained weights, fine-tuning the weights only requires a small learning rate, while the he_normal initialization learning rate used by LSTM and Dense requires a larger learning rate, so this model uses a layered learning rate
?Inject disturbances in the Embedding layer and confront training to make the model more robust.
stop strategy
In the callback, calculate the F1 value of the validation set entity and monitor it. 5 rounds will stop if they do not rise.

# -*- coding:utf-8 -*-
import os
import pickle
from config import batch_size, maxlen, epochs
from evaluate import get_score
from path import BASE_CONFIG_NAME, BASE_CKPT_NAME, BASE_MODEL_DIR, train_file_path, test_file_path, val_file_path, \
    weights_path, event_type, MODEL_TYPE, label_dict_path
from plot import train_plot, f1_plot

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

from model import BERT
from preprocess import load_data, data_generator, NamedEntityRecognizer
from utils.backend import keras, K
from utils.adversarial import adversarial_training
from utils.tokenizers import Tokenizer

# bert configuration
config_path = BASE_CONFIG_NAME
checkpoint_path = BASE_CKPT_NAME
dict_path = '{}/vocab.txt'. format(BASE_MODEL_DIR)

# label data
categories = set()
train_data = load_data(train_file_path, categories)
val_data = load_data(val_file_path, categories)

categories = list(sorted(categories))

with open(label_dict_path, 'wb') as f:
    pickle. dump(categories, f)

# build tokenizer
tokenizer = Tokenizer(dict_path, do_lower_case = True)

bert = BERT(config_path,
            checkpoint_path,
            categories)

model = bert. get_model()
optimizer = bert. get_optimizer()
CRF = bert.get_CRF()
NER = NamedEntityRecognizer(tokenizer, model, categories, trans = K.eval(CRF.trans), starts = [0], ends = [0])

adversarial_training(model, 'Embedding-Token', 0.5)

f1_list = []
recall_list = []
precision_list = []
count_model_did_not_improve = 0


class Evaluator(keras.callbacks.Callback):
    """Evaluate and save
    """
    
    def __init__(self, patience = 5):
        super().__init__()
        self.best_val_f1 = 0
        self. patience = patience
    
    def on_epoch_end(self, epoch, logs = None):
        global count_model_did_not_improve
        save_file_path = ("{}/{}_{}_base".format(weights_path, event_type, MODEL_TYPE)) + ".h5"
        trans = K.eval(CRF.trans)
        NER.trans = trans
        # print(NER.trans)
        optimizer.apply_ema_weights()
        f1, precision, recall = get_score(val_data, NER)
        f1_list.append(f1)
        recall_list.append(recall)
        precision_list.append(precision)
        # save the best
        if f1 >= self. best_val_f1:
            self. best_val_f1 = f1
            model. save_weights(save_file_path)
            pickle.dump(K.eval(CRF.trans),
                        open(("{}/{}_{}_crf_trans.pkl".format(weights_path, event_type, MODEL_TYPE)), 'wb'))
            count_model_did_not_improve = 0
        else:
            count_model_did_not_improve += 1
            print("Early stop count " + str(count_model_did_not_improve) + "/" + str(self. patience))
            if count_model_did_not_improve >= self. patience:
                self.model.stop_training=True
                print("Epoch d: early stopping THR" % epoch)
        optimizer.reset_old_weights()
        print(
            'valid: f1: %.5f, precision: %.5f, recall: %.5f, best f1: %.5f\
' %
            (f1, precision, recall, self. best_val_f1)
        )


train_generator = data_generator(train_data, batch_size, tokenizer, categories, maxlen)
valid_generator = data_generator(val_data, batch_size, tokenizer, categories, maxlen)
# test_generator = data_generator(test_data, batch_size, tokenizer, categories, maxlen)

for i, item in enumerate(train_generator):
    print("\
batch_token_ids shape: shape:", item[0][0].shape)
    print("batch_segment_ids shape:", item[0][1].shape)
    print("batch_labels shape:", item[1].shape)
    if i == 4:
        break
# batch_token_ids: (32, maxlen) or (32, n), n <= maxlen
# batch_segment_ids: (32, maxlen) or (32, n), n <= maxlen
# batch_labels: (32, maxlen) or (32, n), n <= maxlen

evaluator = Evaluator(patient = 5)

print('\
\t\tTrain start!\t\t\
')

history = model.fit(
    train_generator.forfit(),
    steps_per_epoch = len(train_generator),
    epochs = epochs,
    verbose = 1,
    callbacks = [evaluator]
)

print('\
\tTrain end!\t\
')

train_plot(history.history, history.epoch)
data = {
    'epoch': range(1, len(f1_list) + 1),
    'f1': f1_list,
    'recall': recall_list,
    'precision': precision_list
}

f1_plot(data)