2-5 (outside the article) word vector Word2Vec code practical article

Table of Contents

1 Statement:

2 Code link:

3 Reference links:

4 practical steps:

4.1 Data selection:

4.2 Data preprocessing and word segmentation:

4.2.1 Data preprocessing:

4.2.2 Why word segmentation?

4.2.3 Participles:

4.3 Model training:

4.4 Visualization:

4.4.1 PCA dimensionality reduction:

4.4.2 Draw a starry sky map:

4.5 Analogical relationship experiment:

1 claim:

This course offered by Tsinghua University does not have a separate video on Word2Vec, so I briefly practiced the Word2Vec code by referring to the sharing of an UP master on station B.

2 code link:

File link: https://pan.baidu.com/s/1swwl8FD6ITO0mIEwSyTGCw Extraction code: nnnn

3 Reference link:

Word vector | word2vec | Theoretical explanation + code | Text analysis [python-gensim]

4 Practical steps:

4.1 Data Selection:

Here we select the first volume of “Romance of the Three Kingdoms” for Word2vec training.

4.2 Data preprocessing and word segmentation:

4.2.1 Data Preprocessing:

In actual scenarios, punctuation marks, abnormal characters, and stop words in the text need to be removed to obtain relatively clean data. Here we retain all characters except letters, numbers, and spaces. Students who need it can add stop words by themselves. and clean data at a more granular level.

4.2.2 Why is word segmentation necessary?

Alleviating the polysemy problem: For example, “Hahaha” and “Husky” [“Ha”] have completely different meanings. If the model has not seen it during the training process “Husky”, when processed at the level of the smallest unit word, may mistakenly identify “Husky” as a cheerful atmosphere word rather than an animal entity.
Provide higher-level text features to the model: From the perspective of features, word sequences will contain some entities, intentions and other information, and the information content of features is often greater than that of word sequences. Therefore, good word segmentation can provide better quality features to the NLP model, so that the search engine’s understanding of the text can be upgraded from “word sequence” to “word sequence”.

4.2.3 Participle:

The jieba library is an excellent Python third-party Chinese word segmentation library. Jieba supports three word segmentation modes: precise mode, full mode and search engine mode. This article uses the jieba library for word segmentation of Chinese texts.

#Precise mode
jieba.lcut(text, cut_all=False)

#fullmode
jieba.lcut(text, cut_all=True)

#searchenginemode
jieba.lcut_for_search(text)

The differences between the three modes (segmentation methods) are as shown in the figure below:

# participle

f = open("sanguo.txt", 'r',encoding='utf-8') #Read text
lines = []
for line in f: #Segment each paragraph separately
    temp = jieba.lcut(line) #stammering word segmentation precise mode
    words = []
    for i in temp:
        #Filter out all punctuation marks
        i = re.sub("[\s + \.\!\/_,$%^*( + "'""》] + |[ + --!,.?,~@#￥% …… &*():;'] + ", "", i)
        if len(i) > 0:
            words.append(i)
    if len(words) > 0:
        lines.append(words)
print(lines[0:5])#Preview the first 5 lines of word segmentation results

The word segmentation results of the first five lines are as shown below:

4.3 Model training:

The complete Word2Vec model is encapsulated in the gensim library. Here we can directly call the model function. The meaning of its parameters can be read in the following table:

Parameters	Description
sentences	It can be a list. For large corpus, it is recommended to use BrownCorpus, Text8Corpus or lineSentence to build it.
vector_size	The dimension of the word vector, the default is 100. Larger sizes require more training data, but the results will be better. The recommended value is tens to hundreds.
alpha	Learning rate
window	Indicates the maximum distance between the current word and the predicted word in a sentence.
min_count	You can truncate the dictionary. Words whose word frequency is less than min_count will be discarded. The default value is 5.
max_vocab_size	Sets the RAM limit during word vector construction. If the number of all independent words exceeds this, the least frequent one among them is eliminated. Approximately 1GB of RAM is required for every 10 million words. Set to None to have no limit.
sample	Configuration threshold for random downsampling of high-frequency words, the default is 1e-3, the range is (0, 1e-5)
seed	Used for random number generators. Related to initializing word vectors.
workers	The parameter controls the number of parallel training.
sg	is used to set the training algorithm. The default is 0, which corresponds to the CBOW algorithm; sg=1 uses the skip-gram algorithm.
hs	If it is 1, the hierarchica·softmax technique will be used. If set to 0 (default), negative sampling will be used.
negative	If >0, negative sampling will be used to set the number of noise words.
cbow_mean	If it is 0, the sum of the context word vectors is used, if it is 1 (default), the mean is used. This only works when using CBOW.
hashfxn	hash function to initialize the weight. By default, python’s hash function is used.
epochs	Number of iterations, the default is 5.
trim_rule	is used to set the organizing rules of the vocabulary, specifying which words should be kept and which should be deleted. Can be set to None (min_count will be used) or a accept() and return RULE_DISCARD, utils. RULE_KEEP or utils. Function of RULE_DEFAULT.
sorted_vocab	If it is 1 (default), the words will be sorted in descending order based on frequency when assigning word index.
batch_words	The number of words passed to the thread in each batch, the default is 10000
min_alpha	As training proceeds, the learning rate decreases linearly to min_alpha

This table is reproduced from gensim.models.word2vec() parameter details_word2vec vector_size-CSDN blog

# Call Word2Vec training
# Parameters: size: word vector dimension; window: width of context, min_count is the lowest word frequency threshold of words considered for calculation
model = Word2Vec(lines,vector_size = 20, window = 2, min_count = 3, epochs=7, negative=10)
print("Kong Ming's word vector:\
",model.wv.get_vector('Kong Ming'))
print("\
The top 20 words most related to Kong Ming: ")
similar_words = model.wv.most_similar('Kong Ming', topn = 20)# The top 20 words most related to Kong Ming
for word, similarity in similar_words:
    print(word, similarity)

Kong Ming’s word vector:

Display the top 20 words most related to Kong Ming:

4.4 Visualization:

4.4.1 PCA dimensionality reduction:

Since the word vector dimension we set before is 20, it cannot be visualized intuitively and requires dimensionality reduction. This article selects PCA (Principal Component Analysis) for dimensionality reduction:

# Project word vectors into two-dimensional space
rawWordVec = []
word2ind = {}
for i, w in enumerate(model.wv.index_to_key): #index_to_key serial number, word
    rawWordVec.append(model.wv[w]) #Word vector
    word2ind[w] = i #{Word: serial number}
rawWordVec = np.array(rawWordVec)
X_reduced = PCA(n_components=2).fit_transform(rawWordVec)

#20 dimensions before dimensionality reduction
# Output the word vector before dimensionality reduction
print("Word vector before dimensionality reduction:")
print(rawWordVec)

#2 dimensions after dimensionality reduction
# Output the dimensionally reduced word vector
print("\
Word vector after dimensionality reduction:")
print(X_reduced)

Word vector before dimensionality reduction (20 dimensions):

Word vector after dimensionality reduction (2 dimensions):

4.4.2 Draw a starry sky map:

# Draw a starry sky map
# Draw the two-dimensional space projection of all word vectors
fig = plt.figure(figsize = (15, 10))
ax = fig.gca()
ax.set_facecolor('white')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.3, color = 'black')


# Draw vectors of several special words
words = ['Sun Quan', 'Liu Bei', 'Cao Cao', 'Zhou Yu', 'Zhuge Liang', 'Sima Yi', 'Han Xiandi']

#Set Chinese font otherwise garbled characters
zhfont1 = matplotlib.font_manager.FontProperties(fname='./Chinese imitation Song Dynasty.ttf', size=16)
for w in words:
    if w in word2ind:
        ind = word2ind[w]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'orange',markersize=10)
        plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'red')
plt.show()

The black scatter points refer to all words. Through the display of special word vectors, we can see that Liu Bei and Zhuge Liang are relatively close together, and the rest are similar, indicating that the training effect of the model is still good.

4.5 Analogical relationship experiment:

# Analogy relationship experiment

# Xuande-Kongming=? -Cao Cao
words = model.wv.most_similar(positive=['Xuande', 'Cao Cao'], negative=['Kong Ming'])
print(words)

# Cao Cao-Wei=? - Shu
words = model.wv.most_similar(positive=['Cao Cao', 'Shu'], negative=['Wei'])
print(words)

The final result is as shown below (haha, the first matching may not be very effective)