Probabilistic Graphical Model 5-Conditional Random Field CRF Entity Naming Part of Speech Classification

Probabilistic Graphical Model 5-Conditional Random Field CRF Entity Naming Part of Speech Classification

  • 1. Data download
  • 2. Dataset loading
  • 3. Corpus Introduction
  • 4. Text data feature processing
  • 5. Text data feature extraction
  • 6. Conditional random field CRF modeling
  • 7. Training
  • 8. Forecast

1. Data download

import nltk # Natural Language Toolkit: Natural Language Processing Toolkit pip install nltk
import sklearn_crfsuite # sklearn_crfsuite: conditional random field Python library # pip install sklearn_crfsuite
from sklearn_crfsuite import metrics
import ssl # Network settings, skip security verification
ssl._create_default_https_context = ssl._create_unverified_context
# Download and decompress from Baidu Netdisk (unzip to the current directory C:\Users\likai) to: C:\Users\likai\\
ltk_data
# nltk.download('conll2002') download data download path: C:\Users\likai\AppData\Roaming\\
ltk_data
nltk.corpus.conll2002.fileids() # fileids: view data categories
['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']

2. Dataset loading

%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train')) # iob_sents: get sentences list: convert text data into lists
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb')) # conll2002: data set
print('training data length:',len(train_sents))
print('The length of the test data:', len(test_sents))
Training data length: 8323
Length of test data: 1517
Wall time: 9.04 s

3. Corpus Introduction

(1) General Corpus: CoNLL2002
(2) Languages: Spanish
(3) Training set: 8323 sentences
(4) Test set: 1517 sentences
(5) Corpus concept: The general annotation list is [‘O’,’B-MISC’,’I-MISC’,’B-ORG’,’I-ORG’,’B-PER ‘,’I-PER’,’B-LOC’,’I-LOC’] are generally divided into four categories: PER (person name), LOC (location), ORG (organization) and MISC (hodgepodge), and B means start, I means middle, O means single word
(6) Corpus format: The three columns respectively represent vocabulary, part of speech, and entity type; use the BIO annotation set used in the Bakeoff-3 evaluation

  • B-PER, I-PER: represent the first letter of the person’s name, and the person’s name is not the first letter
  • B-LOC, I-LOC: represent the first character of the place name, and the non-first character of the place name
  • B-ORG, I-ORG: represent the first letter of the organization name, and the organization name is not the first letter
  • O: Indicates that the word is not part of the named entity, such as: (‘Australia’, ‘NP’, ‘B-LOC’)

(7) Part of speech: articles, pronouns, verbs, nouns, etc. For example: NP means proper noun PP means past participle

display(train_sents[0])
[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]
 
display(test_sents[0]) # vocabulary, part of speech, category
[('La', 'DA', 'B-LOC'),
 ('Coru?a', 'NC', 'I-LOC'),
 (',', 'Fc', 'O'),
 ('23', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFECOM', 'NP', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

4. Text data feature processing

The main selection process is the following characteristics:

  • (1) Lowercase format of the current word
  • (2) Suffix of the current word
  • (3) Whether the current word is uppercase or not
  • (4) The first letter of the current word is capitalized and other letters are lowercase to judge istitle
  • (5) Whether the current word is digital isdigit
  • (6) The part of speech of the current word
  • (7) The part-of-speech prefix of the current word
def word2features(sent, i):
    word = sent[i][0] # word
    postag = sent[i][1] # part of speech tagging
    features = {<!-- --> # feature information of the current word
        'word.lower()': word.lower(), # lower: lowercase
        'word[-3:]': word[-3:], # word slice
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(), # isupper: determine whether it is uppercase
        'word.istitle()': word.istitle(), # istitle: The first letter of the current word is capitalized and other letters are lowercased
        'word.isdigit()': word.isdigit(), # isdigit: determine whether it is a number
        'postag': postag, # part of speech
        'postag[:2]': postag[:2]} # part of speech slice
    if i > 0: # feature information of the previous word
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features. update({<!-- -->
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1: postag': postag1,
            '-1:postag[:2]':postag1[:2]})
    else:
        features['BOS'] = True # means start
    if i < len(sent)-1: # Feature information of the next word
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features. update({<!-- -->
            ' + 1: word.lower()': word1.lower(),
            ' + 1: word.istitle()': word1.istitle(),
            ' + 1: word.isupper()': word1.isupper(),
            ' + 1: postag': postag1,
            ' + 1: postag[:2]': postag1[:2]})
    else:
        features['EOS'] = True # means end
    return features # return word feature information
def sent2features(sent): # Text data feature extraction
    return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent): # Text data corresponds to part of speech
    return [label for token, postag, label in sent] # token, postag, label: vocabulary, part of speech, category

5. Text data feature extraction

%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]
display(X_train[0],y_train[0])
[…
 {<!-- -->'word.lower()': '.',
  'word[-3:]': '.',
  'word[-2:]': '.',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'Fp',
  'postag[:2]': 'Fp',
  '-1:word.lower()': ')',
  '-1:word.istitle()': False,
  '-1:word.isupper()': False,
  '-1:postag': 'Fpt',
  '-1:postag[:2]': 'Fp',
  'EOS': True}]
['B-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O ', 'B-ORG', 'O', 'O']
Wall time: 52.3 s

6. Conditional random field CRF modeling

%%time
crf = sklearn_crfsuite.CRF(
c1=0.1, # L1 regularization coefficient
c2=0.1, # L2 regularization coefficient
max_iterations=100, # max_iterations: number of gradient descent iterations
all_possible_transitions=True)
# all_possible_transitions: Specifies whether the conditional random field CRF generates transition features that do not even appear in the training data (ie, negative transition features). If True, CRF will generate transition features associated with all possible label pairs

7. Training

crf.fit(X_train, y_train)

8. Forecast

y_pred = crf. predict(X_test)
print('Conditional random field entity naming accuracy is:',crf.score(X_test,y_test))
display(y_pred[:2],y_test[:2])
The conditional random field entity naming accuracy rate is: 0.971455184056818
[['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O'], ['O']]
[['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O'], ['O']]