LSTM: Long Short-Term Memory Network

LSTM basic structure

In order to solve the temporal gradient disappearance in RNN, the field of machine learning has developed a long-short-term memory unit LSTM, which realizes the temporal memory function through the gate switch, and Prevent vanishing gradients.

LSTM is a special kind of RNN (circular neural network), first look at the structural characteristics of RNN:

Network structure of LSTM

It can be found that compared to RNN, which has only one transfer state ht, LSTM has two transfer states, one ct (cell state), and one ht (hidden state). Usually the output ct is the value transmitted from the previous state plus some values, while ht often has a big difference under different nodes.

The figure below shows the calculation process of LSTM. There are four inputs: Z, input gate Zi, output gate Zo, forget gate Zf, and an output a;

The three gates perform their own duties. Each gate usually uses the Sigmoid function as the activation function. The value after activation is between 0 and 1, so it is convenient to control the opening and closing of the “gate”. The input gate determines how far Z can go , the forget gate determines whether the value of the memory unit is refreshed or reset, and the output gate determines whether the last one can be output.

Popular understanding of the three gates of LSTM

A gate is a way to optionally let information pass through, and LSTM has three gates that are used to protect and control the state of the cell.

There are three main stages inside LSTM:
1. Forgetting stage
This stage is mainly to selectively forget the input passed in from the previous node. Simply put, it means “forget the unimportant and remember the important”. Specifically, the calculated zf is used as a forgetting gate to control which of the previous state ct-1 needs to be kept and which needs to be forgotten.
2. Select memory stage
This stage selectively “remembers” the input of this stage. The main purpose is to select and memorize the input xt. Focus on recording what is important, and remember less about what is not important. The current input content is represented by the previous calculation. The selected gating signal is controlled by zi, and the results obtained in the above two steps are added together to obtain the transmission to the next state ct= zf×ct-1 + zi×z
3. Output stage
This stage will determine which will be considered as the output of the current state. It is mainly controlled by zo. And also scaled the co obtained in the previous stage (changed by a tanh activation function), similar to ordinary RNN, the output yt is often finally passed ht changes to get.

LSTM training process

Calculation process

In the first step, we decide what information we want to discard from the cell state. This decision is implemented by a sigmoid layer called a “forget gate”. It looks at ht-1 (the previous output) and xt (the current input) and gives the cell state ct-1 ( Each digit in the previous state) outputs a number between 0 and 1, with 1 representing complete retention and 0 representing complete deletion.

In the second step, decide what information we want to store in the cell state. First, a sigmoid layer called the “input gate layer” determines which values we will update. Next a tanh layer creates the candidate vector ct which will be added to the state of the cell. In the next step, we will combine these two vectors to create the update value.

The third step is to update the state value Ct. We multiply the last state value by ft to express the part we expect to forget. Then we add it?C?t to the obtained value, and this is the new candidate value, which is measured by how much we decide to update each state value .

Finally, we need to decide what we want to output. This output will be based on our cell state, but will be a filtered version. First we run a sigmoid layer which determines which parts of the cell state we want to output, then we pass the cell state through tanh (normalizes the values to be between ~1 and 1) and multiply it by the output of the sigmoid gate , so far we output those parts we decided on.

LSTM training algorithm framework

The training algorithm of LSTM is still the backpropagation algorithm, and we are already very familiar with this algorithm. There are three main steps:

  1. Forward calculation of the output value of each neuron, for LSTM, the above (ft, it, ct, o t, ht) five vector values. The calculation method has been described in the previous section.
  2. Calculate the error term value for each neuron in reverse. Like the Recurrent Neural Network, the backpropagation of the LSTM error term also includes two directions: one is the backpropagation along time, that is, starting from the current time t, the error term at each moment is calculated; One is to propagate the error term up one layer.
  3. Compute the gradient of each weight with respect to the corresponding error term.

LSTM advantages and disadvantages

LSTM advantages

  1. CNN is not fully suitable for learning time series, so various auxiliary processing is required, and the effect is not necessarily good. In the face of problems and tasks that are sensitive to time series, RNN (such as LSTM) is usually more suitable. RNN is used for sequence data and has a certain memory effect;
  2. RNNs can be viewed as a deep feed-forward neural network in which all layers share the same weights. It is difficult to learn and store information for a long time. In order to solve this problem, an idea of increasing network storage was born. The LSTM with special implicit unit is for long-term storage of input. A special kind of unit called a memory cell is similar to an accumulator and a gated neuron: it will have a weight at the next time step and connect to itself, copying the true value of its own state and the accumulated external signal, but this self The linkage is controlled by a multiplicative gate that another unit learns to decide when to clear the contents of the memory;
  3. It solves the problem of gradient disappearance and gradient explosion in the long sequence training process of RNN.

LSTM Disadvantages

  1. There are disadvantages in parallel processing. Compared with some of the latest networks, the effect is average;
  2. The gradient problem of RNN has been solved to a certain extent in LSTM and its variants, but it is still not enough. It can handle sequences of the order of 100, and for sequences of the order of 1000, or longer, it will still be very tricky;
  3. Calculations are time consuming. Each LSTM cell means that there are 4 fully connected layers (MLP). If the time span of LSTM is large and the network is deep, the amount of calculation will be large and time-consuming.

Pytorch-based LSTM code implementation

Below we use a simple small example to illustrate how to use Pytorch to build an LSTM model.

We use the sine function and the cosine function to construct the time series, and the relationship between the sine and cosine functions is a derivative, so we can construct a model to learn the mapping relationship between the sine function and the cosine function, and predict the correspondence by inputting the value of the sine function The value of the cosine function of .

The corresponding relationship between the sine function and the cosine function is shown in the figure below:

img

It can be seen that on each function curve, each value of the sine function corresponds to a value of the cosine function. But in fact, if you only care about the value of the sine function itself without considering the time of the current value, then the value of the sine function and the value of the cosine function are not in a one-to-one correspondence. For example, when t=2.5 and t=6.8, sin(t)=0.5, but at these two different moments, the value of cos(t) is different, that is to say, if time is not considered, the same sine function The value may correspond to several different cosine function values. For the traditional neural network, it only predicts the output based on the current input, which is no longer applicable to the case where the same input may correspond to multiple outputs.

We take the value of the sine function as input to the LSTM to predict the value of the cosine function. Based on Pytorch to build the LSTM model, using 1 input neuron, 1 output neuron, 16 hidden neurons as the parameters of the LSTM network, and the mean absolute error (LMSE) as the loss error, using the Adam optimization algorithm to train the LSTM neural network network. The complete code based on Anaconda and Python3.6 is as follows:

# -*- coding:UTF-8 -*-
import numpy as np
import torch
from torch import nn
import matplotlib.pyplot as plt

# Define LSTM Neural Networks
class LstmRNN(nn.Module):
    """
        Parameters:
        - input_size: feature size
        - hidden_size: number of hidden units
        - output_size: number of output
        - num_layers: layers of LSTM to stack
    """
    def __init__(self, input_size, hidden_size=1, output_size=1, num_layers=1):
        super().__init__()
 
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers) # utilize the LSTM model in torch.nn
        self.forwardCalculation = nn.Linear(hidden_size, output_size)
 
    def forward(self, _x):
        x, _ = self.lstm(_x) # _x is input, size (seq_len, batch, input_size)
        s, b, h = x.shape # x is output, size (seq_len, batch, hidden_size)
        x = x.view(s*b, h)
        x = self. forwardCalculation(x)
        x = x. view(s, b, -1)
        return x

if __name__ == '__main__':
    # create database
    data_len = 200
    t = np.linspace(0, 12*np.pi, data_len)
    sin_t = np. sin(t)
    cos_t = np. cos(t)

    dataset = np. zeros((data_len, 2))
    dataset[:,0] = sin_t
    dataset[:,1] = cos_t
    dataset = dataset.astype('float32')

    # plot part of the original dataset
    plt. figure()
    plt.plot(t[0:60], dataset[0:60,0], label='sin(t)')
    plt.plot(t[0:60], dataset[0:60,1], label = 'cos(t)')
    plt.plot([2.5, 2.5], [-1.3, 0.55], 'r--', label='t = 2.5') # t = 2.5
    plt.plot([6.8, 6.8], [-1.3, 0.85], 'm--', label='t = 6.8') # t = 6.8
    plt.xlabel('t')
    plt.ylim(-1.2, 1.2)
    plt.ylabel('sin(t) and cos(t)')
    plt. legend(loc='upper right')

    # choose dataset for training and testing
    train_data_ratio = 0.5 # Choose 80% of the data for testing
    train_data_len = int(data_len*train_data_ratio)
    train_x = dataset[:train_data_len, 0]
    train_y = dataset[:train_data_len, 1]
    INPUT_FEATURES_NUM = 1
    OUTPUT_FEATURES_NUM = 1
    t_for_training = t[:train_data_len]

    # test_x = train_x
    # test_y = train_y
    test_x = dataset[train_data_len:, 0]
    test_y = dataset[train_data_len:, 1]
    t_for_testing = t[train_data_len:]

    # ----------------- train -------------------
    train_x_tensor = train_x.reshape(-1, 5, INPUT_FEATURES_NUM) # set batch size to 5
    train_y_tensor = train_y.reshape(-1, 5, OUTPUT_FEATURES_NUM) # set batch size to 5
 
    # transfer data to pytorch tensor
    train_x_tensor = torch.from_numpy(train_x_tensor)
    train_y_tensor = torch.from_numpy(train_y_tensor)
    # test_x_tensor = torch. from_numpy(test_x)
 
    lstm_model = LstmRNN(INPUT_FEATURES_NUM, 16, output_size=OUTPUT_FEATURES_NUM, num_layers=1) # 16 hidden units
    print('LSTM model:', lstm_model)
    print('model.parameters:', lstm_model.parameters)
 
    loss_function = nn.MSELoss()
    optimizer = torch.optim.Adam(lstm_model.parameters(), lr=1e-2)
 
    max_epochs = 10000
    for epoch in range(max_epochs):
        output = lstm_model(train_x_tensor)
        loss = loss_function(output, train_y_tensor)
 
        loss. backward()
        optimizer. step()
        optimizer. zero_grad()
 
        if loss.item() < 1e-4:
            print('Epoch [{}/{}], Loss: {:.5f}'. format(epoch + 1, max_epochs, loss. item()))
            print("The loss value is reached")
            break
        elif (epoch + 1) % 100 == 0:
            print('Epoch: [{}/{}], Loss:{:.5f}'.format(epoch + 1, max_epochs, loss.item()))
 
    # prediction on training dataset
    predictive_y_for_training = lstm_model(train_x_tensor)
    predictive_y_for_training = predictive_y_for_training.view(-1, OUTPUT_FEATURES_NUM).data.numpy()

    # torch.save(lstm_model.state_dict(), 'model_params.pkl') # save model parameters to files
 
    # ----------------- test -------------------
    # lstm_model.load_state_dict(torch.load('model_params.pkl')) # load model parameters from files
    lstm_model = lstm_model.eval() # switch to testing model

    # prediction on test dataset
    test_x_tensor = test_x.reshape(-1, 5, INPUT_FEATURES_NUM) # set batch size to 5, the same value with the training set
    test_x_tensor = torch.from_numpy(test_x_tensor)
 
    predictive_y_for_testing = lstm_model(test_x_tensor)
    predictive_y_for_testing = predictive_y_for_testing.view(-1, OUTPUT_FEATURES_NUM).data.numpy()
 
    # ----------------- plot -------------------
    plt. figure()
    plt.plot(t_for_training, train_x, 'g', label='sin_trn')
    plt.plot(t_for_training, train_y, 'b', label='ref_cos_trn')
    plt.plot(t_for_training, predictive_y_for_training, 'y--', label='pre_cos_trn')

    plt.plot(t_for_testing, test_x, 'c', label='sin_tst')
    plt.plot(t_for_testing, test_y, 'k', label='ref_cos_tst')
    plt.plot(t_for_testing, predictive_y_for_testing, 'm--', label='pre_cos_tst')

    plt.plot([t[train_data_len], t[train_data_len]], [-1.2, 4.0], 'r--', label='separation line') # separation line

    plt.xlabel('t')
    plt.ylabel('sin(t) and cos(t)')
    plt. xlim(t[0], t[-1])
    plt.ylim(-1.2, 4)
    plt. legend(loc='upper right')
    plt. text(14, 2, "train", size = 15, alpha = 1.0)
    plt. text(20, 2, "test", size = 15, alpha = 1.0)

    plt. show()

The training process is as follows:

img

The results of the model on the training set and test set are as follows:

img

In the figure, the left side of the red dotted line indicates the performance of the model on the training data set, and the right side indicates the performance of the model on the test data set. It can be seen that using LSTM to build a training model, we can only use the value of the sine function at time t as input to accurately predict the value of the cosine function at time t without adding additional current time information, speed information, etc.

LSTM variant

GRU** (Gated Recurrent Unit Network)**

? LSTM has the disadvantages of long training time, many parameters, and complex internal calculations. In 2014, on the basis of the original LSTM network, Cho et al. combined the forgetting gate and the input gate of the LSTM into a single update gate, removed the cell state, and used the hidden state to transmit information. A variant of the LSTM network, the GRU network (Gated Recurrent Unit, gated recurrent unit network). GRU’s method of calculating new information at the current moment is different from LSTM.

Therefore, summarize the improvement of GRU on the basis of LSTM (draw a key point)

  1. The forget gate and the input gate are combined into one gate: the update gate, and another gate is called the reset gate.
  2. Do not introduce an additional internal state c, directly introduce a linear dependency between the current state ht and the historical state ht-1.

#pic_center)]

The difference between the two

  1. GRU has two gates (reset gate and update gate), while LSTM has three gates (input gate, forget gate and output gate).
  2. GRU does not control and maintain internal memory, and does not have output gates as in LSTM.
  3. The input and forget gates in LSTM correspond to the update gates of GRU, and the reset gate acts directly on the previous hidden state.
  4. GRUs do not apply second-order nonlinearities when computing the output.

The update gate and reset gate

of the GRU model

  • Update gate zt: Responsible for controlling the impact of the previous state information ht-1 on the current state. The larger the value of the update gate, the more the previous time The more the state information ht-1 is brought in.
  • Reset gate rt: **Responsible for controlling the degree of ignoring the state information ht-1 at the previous moment, the smaller the value of the reset gate, the more ignored . **

Intuitively, the reset gate determines how to combine new input information with the previous memory, and the update gate defines the amount of previous memory saved to the current time step. If we set the reset gate to 1 and the update gate to 0, then we get the standard RNN model again.

GRU model process analysis

Update Gate
At time step t, we first need to compute the update gate zt using the following formula:
Where xt is the input vector of the tth time step, that is, the tth component of the input sequence X, which will undergo a linear transformation (with the weight matrix W(z) multiply). ht-1 saves the information of the previous time step t-1, and it will also undergo a linear transformation. The update gate adds these two pieces of information and feeds them into the sigmoid activation function, thus compressing the activation result to a value between 0 and 1.

z

t

=

σ

(

W

(

z

)

x

t

+

u

(

z

)

h

t

?

1

)

z_t=\sigma(W^{(z)}x_t + U^{(z)}h_{t-1})

zt?=σ(W(z)xt? + U(z)ht?1?)

Reset Gate
ht-1 and xt first undergo a linear transformation, and then add them into the Sigmoid activation function to output the activation value

r

t

=

σ

(

W

(

r

)

x

t

+

u

(

r

)

h

t

?

1

)

r_t=\sigma(W^{(r)}x_t + U^{(r)}h_{t-1})

rt?=σ(W(r)xt? + U(r)ht?1?)
The expression is the same as that of the update gate, but the parameters and uses of the linear transformation are different.

Current memory content
In the use of the reset gate, the new memory content will use the reset gate to store past related information, and its calculation expression is (input xt and the previous time step information h t-1 first undergoes a linear transformation, that is, the matrices W and U are multiplied to the right respectively.):

h

t

=

t

a

no

h

(

W

x

t

+

r

t

?

u

h

t

?

1

)

h_t ‘ =tanh(Wx_t + r_t \bigodot Uh_{t-1})

ht′?=tanh(Wxt? + rtUht?1?)
Calculate the Hadamard product of reset gate rt and Uht-1, that is, rt and Uht-1 The corresponding element-wise product. Because the reset gate computed earlier is a vector of 0 to 1, it measures how much the gate is open. For example, if the gate value corresponding to an element is 0, it means that the information of this element is completely forgotten. The Hadamard product will determine the previous information to keep and forget.
Add the calculation results of these two parts and put them into the hyperbolic tangent activation function.

The final memory of the current time step
In the last step, the network needs to calculate ht, which will keep the information of the current unit and pass it to the next unit. In this process, we need to use the update gate, which determines what information needs to be collected in the current memory content h’t and the previous time step ht-1 .

h

t

=

z

t

?

h

t

?

1

+

(

1

?

z

t

)

?

h

t

h_t=z_t \bigodot h_{t-1} + (1-z_t) \bigodot h_t’

ht?=ztht?1? + (1?zt?)?ht′?
zt is the activation result of the update gate, which also controls the inflow of information in the form of gate control. The Hadamard product of zt and ht-1 represents the information retained from the previous time step to the final memory, and this information plus the information retained from the current memory to the final memory is equal to The output of the final gated recurrent unit.

Through the process, it can be found that:The gated recurrent unit will not clear the previous information over time, it will retain the relevant information and pass it to the next unit, so it uses all the information and avoids the problem of gradient disappearance.

PeepholeLSTM

That is, when calculating the input gate, forget gate and output gate, we not only consider h and x, but also C

coupled LSTM

Two-in-one input gate and forget gate

Conv LS

It can be seen that the structure of the peephole LSTM is also used in the conv LSTM – the cell part is also used for the calculation of the forget gate and the input gate

So we have the following calculation process

img

Here * means convolution operation ● means Hadamard product

Another way to understand convLSTM is that our ordinary LSTM can be regarded as a ConvLSTM with the last two dimensions being 1, where the convolution kernel size is 1×1