word2vec improved hierarchical softmax

Table of Contents

summary

overall process

1.Construct Huffman tree

2. Combine CBOW and Huffman trees

3. Build word2vec model (CBOW)

summary

Complete code

Summary

This blog mainly describes in detail the specific process and code implementation of word2vec using hierarchical softmax (hierarchical softmax). For the basic implementation of word2vec, you can read my previous blog:

[Study Notes] Word2vec of Handwritten Neural Network_Good-for-Nothingle’s Blog-CSDN Blog

Overall process

The overall process is basically similar to the basic word2vec, except that the part of using softmax to calculate the loss has been improved. Seeing this, you may wonder why improvements are needed? The reason is very simple, because the amount of calculation of softmax is very large. Once the dimension of embedding is slightly larger and the window is larger, the training time will be greatly lengthened. This is because softmax requires a large number of exponential operations and is very slow. Here we use The essence of hierarchical softmax is to construct a Huffman tree instead of using softmax to calculate the multi-class loss. As shown below:

All we need to do is use softmax Huffman tree is used instead. The Huffman tree is as shown below:

The main problem of using Huffman tree instead of softmax is how to use Huffman tree to be perfectly inserted into word2vec instead of using Multi-category cross entropy to calculate the loss . I think this step is the most important, followed by how to construct a Huffman tree. (I have read a lot of articles. Most of them do not explain in an easy-to-understand manner how to integrate Huffman trees into word2vec. Instead, they spend a lot of space explaining how to build Huffman trees and the calculation formula of hierarchical softmax in word2vec. I think the gradient formula, etc. is putting the cart before the horse, and it took me a long time to understand it.) The key is to use the sigmoid function for each non-leaf nodes and word vectors determine whether the Huffman tree is going left or right. Use this value with each true path to calculate the loss. In the figure below, I use hand drawing to roughly analyze the overall process (here, CBOW is taken as an example):

Compared with the previous neural network language model, all the internal nodes of our Huffman tree are similar to the neurons in the hidden layer of the previous neural network. Among them, the word vector of the root node corresponds to our projected word vector, and all leaves The nodes are similar to the neurons in the softmax output layer of the previous neural network, and the number of leaf nodes is the size of the vocabulary. In the Huffman tree, the softmax mapping from the hidden layer to the output layer is not completed at once, but is completed step by step along the Huffman tree, so this softmax is named “Hierarchical Softmax”.

How to “complete step by step along the Huffman tree”? In word2vec, we use the binary logistic regression method, which stipulates that if you walk along the left subtree, then it is a negative class (Huffman tree code 1), and if you walk along the right subtree, it is a positive class (Huffman tree code 1) Mann tree encoding 0). The way to distinguish between positive and negative classes is to use the sigmoid function, that is:

$P( + )=\sigma (x_{w}\theta^{T})=\frac{1}{1 + e^{x_{w}\theta^{T} }}$

where $x_{w}$ is the word vector of the current internal node, and θ is the model parameter of the logistic regression that we need to find from the training samples.

What are the benefits of using Huffman trees? First of all, since it is a binary tree, V now becomes $log_{2}V$ . Second, because the Huffman tree is used to place high-frequency words close to the root of the tree, it takes less time to find high-frequency words, which is in line with our greedy optimization idea.

It is easy to understand that the probability of being divided into a left subtree and becoming a negative class is P(?)=1?P(+). At a certain internal node, the criterion for judging whether to follow the left subtree or the right subtree is to see which probability value is higher, P(?) or P(+). One of the factors that controls the probability value of P(?) or P(+) is the word vector of the current node, and the other is the model parameter θ of the current node.

For the jumps in the figure above, if it is the output of a training sample, then we expect that the probability of n(w,1)P(+) for the hidden nodes inside is large, n(w,3), n(w, 5 ) has a high probability of P(+), that is, the final target path is: 011

Back to word2vec itself based on Hierarchical Softmax, our goal is to find suitable word vectors of all nodes and all internal nodes θ, so that the training sample reaches the maximum likelihood.

Detailed process:

1. Construct Huffman tree

The essence of constructing a Huffman tree is to use a binary tree to find the optimal path. Here we directly construct a binary tree based on word frequency. The higher the frequency of a word, the closer it is to the root node. In this way, the number of nodes that need to be traversed in the entire vocabulary will be minimized. The code and analysis are as follows:

import heapq # The heapq library is one of the Python standard libraries. It provides methods for building a small top heap and some basic operations on the small top heap (such as entering the heap, removing the heap, etc.), and can be used to implement the heap sorting algorithm. from collections import defaultdict #defaultdict is mainly a subclass of dict in collections classHuffmanNode: # This class is used to build Huffman tree nodes def __init__(self, word, freq): self.word = word self.freq = freq self.left = None self.right = None self.parent = None def __lt__(self, other): # This magic method is actually triggered automatically when two HuffmanNode instance objects are compared. return self.freq < other.freq # This class is used to build Huffman trees def build_huffman_tree(vocabulary): # The essence of defaultdict is to construct a dict and assign the default value to 0. After that, you can directly assign a value to the empty dict and the key will be automatically generated. word_freq = defaultdict(int) # Calculate the word frequency of each word for word in vocabulary: word_freq[word] + = 1 # Take out all the values in the word frequency dictionary to build a Huffman node and put it into min_heap. min_heap means small root heap. # But the list here is not sorted min_heap = [HuffmanNode(word, freq) for word, freq in word_freq.items()] #Heapq.heapify is called here to sort and build the heap. The default is to build a small root heap. # Because the heapq here directly calls the values in the array to compare the size by default, and min_heap is filled with Huffman Node. # Direct comparison will report an error, so you need to write the __lt__ magic method, otherwise an error will be reported heapq.heapify(min_heap) # The following loop statement actually starts to build the Huffman tree while len(min_heap) > 1: # heappop is the minimum pop-up value, which is the root node of the small root heap. Each time heapq is called, the small root heap will be rebuilt based on the current value. left = heapq.heappop(min_heap) right = heapq.heappop(min_heap) # This new_node is the constructed node. This node is constructed from bottom to top in the Huffman tree (because a small root heap is used here) # In the Huffman tree, the node does not have a word attribute, and the word frequency attribute is the sum of its left and right subtrees. new_node = HuffmanNode(None, left.freq + right.freq) new_node.left = left new_node.right = right # None here is a temporary value. The nodes in the Huffman tree have their corresponding addresses (which will be overwritten by the address of new_node later) new_node.parent = None left.parent = new_node #Assign the left child node to its parent node right.parent = new_node # Assign the right child node to its parent node #Heappushu here is to add new_node to min_heap and then rebuild the small root heap heapq.heappush(min_heap, new_node) # This is the last largest node in min_heap, which is the root node of the Huffman tree. huffman_tree = min_heap[0] return huffman_tree # This method is to construct the path of each word in the Huffman tree def calculate_probabilities(huffman_tree): parents = [] def traverse(node, path, probs): # Store the path of leaf nodes if node.word is not None: probs[node.word] = path # Take out all non-leaf nodes and then construct the θ matrix required if node.parent is not None: parents.append(node.parent) if node.left is not None: traverse(node.left, path + "1", probs) if node.right is not None: traverse(node.right, path + "0", probs) probs = {} traverse(huffman_tree, "", probs) return probs, parents # Display the constructed Huffman tree, None represents non-leaf nodes def display_huffman_tree(node, prefix='', is_left=True): if node is not None: print(f"{'L' if is_left else 'R'}{prefix}{node.word}:{node.freq}") display_huffman_tree(node.left, prefix + '|-- ', True) display_huffman_tree(node.right, prefix + '|-- ', False)

2. Combine CBOW and Huffman tree

This part is the key part mentioned above. The loss is calculated by calculating the entropy difference between the maximum likelihood estimate of a certain node of the Huffman tree and the real path, and the gradient is calculated for back propagation. The code is as follows:

class CBOWWithHuffman: # Instantiate each value Dictionary size Dimension of word mapping Huffman tree root node Learning rate def __init__(self, vocab_size, embedding_dim, huffman_tree, learning_rate=0.01): self.vocab_size = vocab_size self.embedding_dim = embedding_dim self.huffman_tree = huffman_tree self.learning_rate = learning_rate # Each incoming value here represents 'the weighted summed average of all words in the window' and 'all words in the window except the center word' # 'Dictionary of theta values corresponding to non-leaf nodes of the Huffman tree', 'Target path (direction of the Huffman tree)', 'Depth of the current word' def train(self, x_w, embedding_bag, parents_theta, direction, l_w): # Because the average loss needs to be calculated, the loss is reset to zero for each training session self.loss = 0 # count is used to monitor l_w self.count = 0 # e is used to update the embedding value of each word in the window, and needs to be reset to zero for each training self.e = 0 # The following recursive calculation is the embodiment of the calculation formula. The specific calculation formula will be discussed below. def travel(node, d, x_w): d = int(d) if node is None: return if node.parent is not None: theta = parents_theta[node.parent] f = self.softmax(np.dot(x_w, theta.T)) # l_f = np.dot(np.power(f, 1 - d), np.power(1 - f, d)) g = (1 - d - f) * self.learning_rate self.e + = np.dot(theta, g) parents_theta[node.parent] + = np.dot(x_w, g) l = -(1-d)*np.log(f) - d*np.log(1-f) self.loss + = l if d == 0: self.count + = 1 if self.count == l_w: pass else: travel(node.left, direction[self.count], x_w) if d == 1: self.count + = 1 if self.count == l_w: pass else: travel(node.left, direction[self.count], x_w) travel(self.huffman_tree, direction[self.count], x_w) # Update parameters embedding_bag + = self.e return self.loss, embedding_bag def softmax(self, x): return 1 / (1 + np.exp(-x))

Here we talk about the formula written above.

We use the maximum likelihood method to find the word vectors of all nodes and all internal nodes θ. Taking the jumps example above first, we expect to maximize the following likelihood function:

$\prod_{i=1}^{3} P(n(w_{i}), i)=\frac{1}{1 + e^{-x_{w}\theta_{1} ^{T}}}(1-\frac{1}{1 + e^{-x_{w}\theta_{2}^{T}}})(1-\frac{1}{1 + e^ {-x_{w}\theta_{3}^{T}}})$

For all training samples, we hope to maximize the product of the likelihood functions of all samples.

In order to facilitate our general description later, we define the input word as w, and the Huffman tree root node word vector after summing and averaging the input layer word vectors is $x_{w}$ , from the root node to the leaf node where w is located, the total number of nodes included is $l_{ w}$ , w starts from the root node in the Huffman tree and passes through the i-th Nodes are represented as $p_{i}^{w}$ , the corresponding Huffman code is $d_{i}^{w}$ ∈{0,1}, where i=2,3,… $l_{w}$ . The model parameters corresponding to this node are expressed as $\theta _{i}^{w}$ , where i=1,2,… $l_{w}-1$ , no $i=l_{w}$ Because the model parameters are only for the internal nodes of the Huffman tree.

Define the logistic regression probability of a node j in the Huffman tree passed by w as $P(d_{j}^{w}|x_{w},\theta_{j-1}^{w})$ , its expression is:

? $P(d_{j}^{w}|x_{w},\theta_{j-1}^{w})=\left\{\begin{matrix} \sigma (x_{w}(\theta_{j}^{w})^{T}) & amp; & amp; d_{j}^{w}=0\ 1-\sigma (x_{w} (\theta_{j}^{w})^{T}) & amp; & amp; d_{j}^{w}=1 \end{matrix}\right.$

Then for a certain target output word w, its maximum likelihood is:

$\prod_{j=2}^{l_{w}}P(d_{j}^{w}|x_{w},\theta_{j-1}^{w})=\prod_{j=2}^{l_{w}}[\sigma(x_{w}(\theta_{j-1}^{w})^{T})]^{1-d_{j}^{w}}[1-\sigma(x_{w}(\theta_{j-1}^{w})^{T})]^{d_{j}^{w}}$

In order to simplify the operation, we usually perform log processing on the maximum likelihood estimate, which also satisfies the definition of cross entropy and gets:

$L=log\prod_{j=2}^{l_{w}} P(d_{j}^{w}|x_{w}, \theta _{j-1}^{w} )=\sum_{j=2}^{l_{w}}(1-d_{j}^{w})log[\sigma(x_{w}(\theta_{j-1}^{w}) ^{T})] + d_{j}^{w}log[1-\sigma(x_{w}(\theta_{j-1}^{w})^{T})]$

After getting the loss function, we can perform gradient calculation to facilitate backpropagation:

$\frac{\partial L}{\partial \theta_{j-1}^{w}}=\frac{\partial L}{\partial f(X)}\frac{\partial f( X)}{\partial X}\frac{\partial X}{\partial \theta_{j-1}^{w}}$

$f(X)=\frac{1}{1 + e^{-X}}$

$X=x_{w}(\theta_{j-1}^{w})^T$

Then enter the formula and simplify it to get:

$\frac{\partial L}{\partial \theta_{j-1}^{w}}=(1-d_{j}^{w}-\sigma (x_{w}(\theta_ {j-1}^{w})^{T}))x_w$

$\frac{\partial L}{\partial x_{w}}=(1-d_{j}^{w}-\sigma (x_{w}(\theta_{j-1}^{ w})^{T}))\theta_{j-1}^{w}$

After we have the formula, we can bring it into the calculation. As shown in the above code, g represents the gradient, e represents the gradient corresponding to x_w, and θ corresponds to the gradient of θ.

3. Build word2vec model (CBOW)

To build a bag-of-words model, the code is as follows:

def word2vec(vocabulary, windows_size): length = len(vocabulary) n = windows_size * 2 + 1 # Bag of words one_hot encoding bag = np.eye(n) # Calculate total loss sum_loss = 0 # Store all word vectors word_dict = {} epochs = 10000 for i, word in enumerate(vocabulary): target_path = list(tree_path[word]) theta = np.random.randn(len(parents), 20) parents_theta = {} for i, parent in enumerate(parents): parents_theta[parent] = theta[i] #Initialize x_w weight w_linear = [np.random.randn(n, 20) if (j + i) >= windows_size and (j + i - 1) <= length else np.zeros((n, 20)) for j in range(windows_size * 2)] #x_i temp_bag = [bag[x] for x in range(n) if x != windows_size] # context_embedding embedding_bag = [np.dot(temp_bag[j], w_linear[j]) for j in range(windows_size * 2)] #x_w x_w = sum(embedding_bag) / (windows_size*2) temp_loss = 0 for epoch in range(epochs): loss, embedding_bag = cbow.train(x_w, embedding_bag, parents_theta, target_path, len(target_path)) temp_loss + = loss sum_loss + = temp_loss print("The loss of current word: {} is: {}".format(word, temp_loss/epochs)) vector = sum(embedding_bag) / (windows_size * 2) #Put vector into dictionary if word in word_dict: word_dict[word] = (word_dict[word] + vector) / 2 else: word_dict[word] = vector return sum_loss / length / epochs, word_dict

I won’t explain too much about this step. If you don’t understand, you can read my last blog.

Summary

This blog was implemented by me in the process of learning NLP. The code and everything are handwritten by me. If there are any errors, please point them out.

References for this blog:

Principle of word2vec (2) Model based on Hierarchical Softmax – Liu Jianping Pinard – Blog Park (cnblogs.com)

Full code

import heapq import jieba import numpy as np from collections import defaultdict classHuffmanNode: def __init__(self, word, freq): self.word = word self.freq = freq self.left = None self.right = None self.parent = None def __lt__(self, other): return self.freq < other.freq def build_huffman_tree(vocabulary): word_freq = defaultdict(int) for word in vocabulary: word_freq[word] + = 1 min_heap = [HuffmanNode(word, freq) for word, freq in word_freq.items()] heapq.heapify(min_heap) while len(min_heap) > 1: left = heapq.heappop(min_heap) right = heapq.heappop(min_heap) new_node = HuffmanNode(None, left.freq + right.freq) new_node.left = left new_node.right = right new_node.parent = None left.parent = new_node #Assign the left child node to its parent node right.parent = new_node # Assign the right child node to its parent node heapq.heappush(min_heap, new_node) huffman_tree = min_heap[0] return huffman_tree def calculate_probabilities(huffman_tree): parents = [] def traverse(node, path, probs): if node.word is not None: probs[node.word] = path if node.parent is not None: parents.append(node.parent) if node.left is not None: traverse(node.left, path + "1", probs) if node.right is not None: traverse(node.right, path + "0", probs) probs = {} traverse(huffman_tree, "", probs) return probs, parents def display_huffman_tree(node, prefix='', is_left=True): if node is not None: print(f"{'L' if is_left else 'R'}{prefix}{node.word}:{node.freq}") display_huffman_tree(node.left, prefix + '|-- ', True) display_huffman_tree(node.right, prefix + '|-- ', False) classCBOWWithHuffman: def __init__(self, vocab_size, embedding_dim, huffman_tree, learning_rate=0.01): self.vocab_size = vocab_size self.embedding_dim = embedding_dim self.huffman_tree = huffman_tree self.learning_rate = learning_rate def train(self, x_w, embedding_bag, parents_theta, direction, l_w): self.loss = 0 self.count = 0 self.e = 0 def travel(node, d, x_w): d = int(d) if node is None: return if node.parent is not None: theta = parents_theta[node.parent] f = self.softmax(np.dot(x_w, theta.T)) # l_f = np.dot(np.power(f, 1 - d), np.power(1 - f, d)) g = (1 - d - f) * self.learning_rate self.e + = np.dot(theta, g) parents_theta[node.parent] + = np.dot(x_w, g) l = -(1-d)*np.log(f) - d*np.log(1-f) self.loss + = l if d == 0: self.count + = 1 if self.count == l_w: pass else: travel(node.left, direction[self.count], x_w) if d == 1: self.count + = 1 if self.count == l_w: pass else: travel(node.left, direction[self.count], x_w) travel(self.huffman_tree, direction[self.count], x_w) # Update parameters embedding_bag + = self.e return self.loss, embedding_bag def softmax(self, x): return 1 / (1 + np.exp(-x)) def word2vec(vocabulary, windows_size): length = len(vocabulary) n = windows_size * 2 + 1 # Bag of words one_hot encoding bag = np.eye(n) sum_loss = 0 # Store all word vectors word_dict = {} epochs = 10000 for i, word in enumerate(vocabulary): target_path = list(tree_path[word]) theta = np.random.randn(len(parents), 20) parents_theta = {} for i, parent in enumerate(parents): parents_theta[parent] = theta[i] #Initialize x_w weight w_linear = [np.random.randn(n, 20) if (j + i) >= windows_size and (j + i - 1) <= length else np.zeros((n, 20)) for j in range(windows_size * 2)] # x_i temp_bag = [bag[x] for x in range(n) if x != windows_size] # context_embedding embedding_bag = [np.dot(temp_bag[j], w_linear[j]) for j in range(windows_size * 2)] #x_w x_w = sum(embedding_bag) / (windows_size*2) temp_loss = 0 for epoch in range(epochs): loss, embedding_bag = cbow.train(x_w, embedding_bag, parents_theta, target_path, len(target_path)) temp_loss + = loss sum_loss + = temp_loss print("The loss of current word: {} is: {}".format(word, temp_loss/epochs)) vector = sum(embedding_bag) / (windows_size * 2) #Put vector into dictionary if word in word_dict: word_dict[word] = (word_dict[word] + vector) / 2 else: word_dict[word] = vector return sum_loss / length / epochs, word_dict if __name__ == '__main__': sentence = 'This article explores whether AI can replace the role of engineering managers. A friend of the author came up with the idea of using an AI agent called EMAI to manage software engineering teams,' \ 'Thought this would improve efficiency and objectivity. The authors refute this idea, noting that human managers have empathy and intuition that AI cannot match and can shape organizational culture,' \ 'The ability to motivate people, build friendships, and adapt to change. The author finally takes a stand, choosing to trust coffee (or tea)-powered engineering managers over electricity-powered AI. '\ 'The authors hope to be able to automate tasks that get in the way of meaningful work, allowing for work that empathizes, cares and truly serves people. ' vocabulary = jieba.lcut(sentence) huffman_tree = build_huffman_tree(vocabulary) cbow = CBOWWithHuffman(len(vocabulary), 20, huffman_tree) tree_path, temp = calculate_probabilities(huffman_tree) parents = np.unique(temp) sum_loss, word_dict = word2vec(vocabulary, 2) print() print('loss = ', sum_loss) print('dict: ') for i in set(vocabulary): print("word: {} vector: {}".format(i, word_dict[i]))