Probabilistic Graphical Model 5-Conditional Random Field CRF Entity Naming Part of Speech Classification
- 1. Data download
- 2. Dataset loading
- 3. Corpus Introduction
- 4. Text data feature processing
- 5. Text data feature extraction
- 6. Conditional random field CRF modeling
- 7. Training
- 8. Forecast
1. Data download
import nltk # Natural Language Toolkit: Natural Language Processing Toolkit pip install nltk import sklearn_crfsuite # sklearn_crfsuite: conditional random field Python library # pip install sklearn_crfsuite from sklearn_crfsuite import metrics import ssl # Network settings, skip security verification ssl._create_default_https_context = ssl._create_unverified_context # Download and decompress from Baidu Netdisk (unzip to the current directory C:\Users\likai) to: C:\Users\likai\\ ltk_data # nltk.download('conll2002') download data download path: C:\Users\likai\AppData\Roaming\\ ltk_data nltk.corpus.conll2002.fileids() # fileids: view data categories ['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']
2. Dataset loading
%%time train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train')) # iob_sents: get sentences list: convert text data into lists test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb')) # conll2002: data set print('training data length:',len(train_sents)) print('The length of the test data:', len(test_sents)) Training data length: 8323 Length of test data: 1517 Wall time: 9.04 s
3. Corpus Introduction
(1) General Corpus: CoNLL2002
(2) Languages: Spanish
(3) Training set: 8323 sentences
(4) Test set: 1517 sentences
(5) Corpus concept: The general annotation list is [‘O’,’B-MISC’,’I-MISC’,’B-ORG’,’I-ORG’,’B-PER ‘,’I-PER’,’B-LOC’,’I-LOC’] are generally divided into four categories: PER (person name), LOC (location), ORG (organization) and MISC (hodgepodge), and B means start, I means middle, O means single word
(6) Corpus format: The three columns respectively represent vocabulary, part of speech, and entity type; use the BIO annotation set used in the Bakeoff-3 evaluation
- B-PER, I-PER: represent the first letter of the person’s name, and the person’s name is not the first letter
- B-LOC, I-LOC: represent the first character of the place name, and the non-first character of the place name
- B-ORG, I-ORG: represent the first letter of the organization name, and the organization name is not the first letter
- O: Indicates that the word is not part of the named entity, such as: (‘Australia’, ‘NP’, ‘B-LOC’)
(7) Part of speech: articles, pronouns, verbs, nouns, etc. For example: NP means proper noun PP means past participle
display(train_sents[0]) [('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')] display(test_sents[0]) # vocabulary, part of speech, category [('La', 'DA', 'B-LOC'), ('Coru?a', 'NC', 'I-LOC'), (',', 'Fc', 'O'), ('23', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFECOM', 'NP', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]
4. Text data feature processing
The main selection process is the following characteristics:
- (1) Lowercase format of the current word
- (2) Suffix of the current word
- (3) Whether the current word is uppercase or not
- (4) The first letter of the current word is capitalized and other letters are lowercase to judge istitle
- (5) Whether the current word is digital isdigit
- (6) The part of speech of the current word
- (7) The part-of-speech prefix of the current word
def word2features(sent, i): word = sent[i][0] # word postag = sent[i][1] # part of speech tagging features = {<!-- --> # feature information of the current word 'word.lower()': word.lower(), # lower: lowercase 'word[-3:]': word[-3:], # word slice 'word[-2:]': word[-2:], 'word.isupper()': word.isupper(), # isupper: determine whether it is uppercase 'word.istitle()': word.istitle(), # istitle: The first letter of the current word is capitalized and other letters are lowercased 'word.isdigit()': word.isdigit(), # isdigit: determine whether it is a number 'postag': postag, # part of speech 'postag[:2]': postag[:2]} # part of speech slice if i > 0: # feature information of the previous word word1 = sent[i-1][0] postag1 = sent[i-1][1] features. update({<!-- --> '-1:word.lower()': word1.lower(), '-1:word.istitle()': word1.istitle(), '-1:word.isupper()': word1.isupper(), '-1: postag': postag1, '-1:postag[:2]':postag1[:2]}) else: features['BOS'] = True # means start if i < len(sent)-1: # Feature information of the next word word1 = sent[i + 1][0] postag1 = sent[i + 1][1] features. update({<!-- --> ' + 1: word.lower()': word1.lower(), ' + 1: word.istitle()': word1.istitle(), ' + 1: word.isupper()': word1.isupper(), ' + 1: postag': postag1, ' + 1: postag[:2]': postag1[:2]}) else: features['EOS'] = True # means end return features # return word feature information def sent2features(sent): # Text data feature extraction return [word2features(sent, i) for i in range(len(sent))] def sent2labels(sent): # Text data corresponds to part of speech return [label for token, postag, label in sent] # token, postag, label: vocabulary, part of speech, category
5. Text data feature extraction
%%time X_train = [sent2features(s) for s in train_sents] y_train = [sent2labels(s) for s in train_sents] X_test = [sent2features(s) for s in test_sents] y_test = [sent2labels(s) for s in test_sents] display(X_train[0],y_train[0]) [… {<!-- -->'word.lower()': '.', 'word[-3:]': '.', 'word[-2:]': '.', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'Fp', 'postag[:2]': 'Fp', '-1:word.lower()': ')', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'Fpt', '-1:postag[:2]': 'Fp', 'EOS': True}] ['B-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O ', 'B-ORG', 'O', 'O'] Wall time: 52.3 s
6. Conditional random field CRF modeling
%%time crf = sklearn_crfsuite.CRF( c1=0.1, # L1 regularization coefficient c2=0.1, # L2 regularization coefficient max_iterations=100, # max_iterations: number of gradient descent iterations all_possible_transitions=True) # all_possible_transitions: Specifies whether the conditional random field CRF generates transition features that do not even appear in the training data (ie, negative transition features). If True, CRF will generate transition features associated with all possible label pairs
7. Training
crf.fit(X_train, y_train)
8. Forecast
y_pred = crf. predict(X_test) print('Conditional random field entity naming accuracy is:',crf.score(X_test,y_test)) display(y_pred[:2],y_test[:2]) The conditional random field entity naming accuracy rate is: 0.971455184056818 [['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O'], ['O']] [['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O'], ['O']]