Analysis of the LSTM principle of NLP

Article directory

  • background
    • Limitations of simpleRNN
  • LSTM
    • Write a sigmoid example by hand
    • Neural networks supporting long memory
    • Interpretation of the three gates

Background

SimpleRNN has certain limitations,

  1. Text on image:

    • The picture title mentions “SimpleRNN is a basic model. It is used to solve sequential problems, in which the output of each step will affect the result of the next step. The formula and structure diagram in the figure both show this relationship.”
    • Four lines of pseudocode are given below, describing the calculation method of SimpleRNN. Simplified to the following form:
      1. out1 and ht1 are obtained by calculating a certain activation function by inputting x1, the state h(t-1) of the previous moment, the weights w1, u1, and the bias term bias.
      2. out2 and ht2 are calculated by inputting x2, the state ht1 at the previous moment, the weights w2, u2, and the bias term bias.
      3. out3 and ht3 are calculated by inputting x3, the state ht2 of the previous moment, the weights w3, u3 and the bias term bias.
      4. The calculation methods of out4 and ht4 are not given in full, but it can be speculated that they are similar to the previous calculation methods.
  2. Image content in image:

    • The figure shows a sequential network structure, where each time step has an input and an output.
    • From left to right, we can see how the data flows. The input of each time step is labeled as “input”, the output of each time step is labeled as “output”, and there is a state “state” between each time step.
    • The figure also shows how these states are passed from one time step to the next, demonstrating the “memory” characteristics of RNN.
  3. Interpretation of the working mechanism of SimpleRNN:

    • SimpleRNN takes an input at each time step and produces an output. But unlike general neural networks, SimpleRNN also maintains a “state” that is passed from one time step to the next.
    • This state can be thought of as the “memory” of the network, which carries information from the past and is used to influence calculations at the current time step.
    • Pseudocode shows how the output and state are calculated at each time step, both of which depend on the current input, the state of the previous time step, weights, and biases.
  4. Identify the location in the image:

    • In the figure, “I, love, and motherland” are marked above each time step. They are connected to the “compute” module for each time step.
    • The “memory” is marked at the center of each time step and passed between time steps.
  5. Explain their role in SimpleRNN:

    • “I, love, motherland”: These words represent the input of each time step. In this example, we can think that we are processing a text sequence, namely “I love my motherland”. At each time step, the “compute” module receives these words as input.
    • “memory”: This represents the internal state or “hidden state” of SimpleRNN. It is passed between time steps and saves information from previous time steps. At each time step, the “memory” is updated and used for the calculation of the next time step.

So, the answer is “I, love, motherland” is the input, and “memory” represents the internal state of SimpleRNN. If you have additional questions or need further clarification, please feel free to let me know.

In short, the picture shows how SimpleRNN accepts input at each time step and produces an output based on the “memory” of the previous time step.

Limitations of simpleRNN

  1. What are neural networks and simpleRNN?

    • A neural network is a computational model used for data processing and pattern recognition. They are commonly used for tasks such as image recognition, natural language processing, etc.
    • simpleRNN (simple recurrent neural network) is a special type of neural network used to process sequence data, such as text or time series data.
  2. The main limitations of simpleRNN and a brief explanation

    • Gradient disappearance and gradient explosion problem: When processing long sequences, it is difficult for simpleRNN to learn the importance of early information, mainly because the gradient (i.e., the signal used to update the model weights) will decrease over time. Small (disappear) or grow (explode).

    • Short-term memory: simpleRNN can usually only remember short-term information, which means it is not good at handling tasks with long-term dependencies.

    • Computational efficiency: Despite its relatively simple structure, simpleRNN can become computationally intensive and inefficient when processing very long sequences.

    • Overfitting: Because the model is simple, it is prone to overfitting, that is, it performs well on training data but performs poorly on unseen data.

These are the main limitations of simple recurrent neural networks (simpleRNN).

LSTM

Write a sigmoid example by hand

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(-5.0, 5.0, 0.1)
print(x)
y = 1 / (1 + np.exp(-x))
print(y)
plt.plot(x, y)
plt.show()

Neural network supporting long memory



Interpret and give a process explanation of the network structure shown in the picture.

  1. Identify key parts of the diagram:

    • A: The core computing unit of the network.
    • X

      t

      ?

      1

      X_{t-1}

      Xt?1?,

      X

      t

      X_t

      Xt?,

      X

      t

      +

      1

      X_{t + 1}

      Xt + 1?: Enter individual time steps in the sequence.

    • h

      t

      ?

      1

      h_{t-1}

      ht?1?,

      h

      t

      h_t

      ht?,

      h

      t

      +

      1

      h_{t + 1}

      ht + 1?: The output or hidden state corresponding to the time step.

    • “tanh” activation function, addition and multiplication operations.
  2. Provide a description for each section:

    • A: It is the core part of the network and is responsible for all calculations. Receives input and the hidden state of the previous time step, and outputs the hidden state of the current time step.
    • X

      t

      ?

      1

      X_{t-1}

      Xt?1?,

      X

      t

      X_t

      Xt?,

      X

      t

      +

      1

      X_{t + 1}

      Xt + 1?: These are the data input into the network sequentially, corresponding to successive time steps.

    • h

      t

      ?

      1

      h_{t-1}

      ht?1?,

      h

      t

      h_t

      ht?,

      h

      t

      +

      1

      h_{t + 1}

      ht + 1?: These are the output or hidden states of the network at various time steps. They contain information from previous time steps and are passed on in successive time steps.

    • “tanh” is an activation function used for nonlinear transformations.
  3. Describe the entire process:

    • Starting at time step t-1, enter

      X

      t

      ?

      1

      X_{t-1}

      Xt?1? and hidden state

      h

      t

      ?

      2

      h_{t-2}

      ht?2? is provided to unit A.

    • In unit A, calculations of multiplication, addition and the “tanh” activation function are performed.
    • The output result is hidden

      h

      t

      ?

      1

      h_{t-1}

      ht?1?, this state is also the output of this time step and will be passed to the next time step.

    • For time step t, the process repeats, input

      X

      t

      X_t

      Xt? and hidden state

      h

      t

      ?

      1

      h_{t-1}

      ht?1? is provided to unit A and the output is

      h

      t

      h_t

      ht?.

    • The same process continues, for time step t + 1, the input is

      X

      t

      +

      1

      X_{t + 1}

      Xt + 1? and hidden state

      h

      t

      h_t

      ht?, the output is

      h

      t

      +

      1

      h_{t + 1}

      ht+1?.

Overall, this is a simplified representation of a Recurrent Neural Network (RNN) for processing sequence data. Each time step receives an input and the hidden state of the previous time step, produces an output, and passes that output to the next time step.

Interpretation of the three gates


In the picture above, i=input o=output f=forget

This is a picture showing the calculation process of a certain unit in the Long Short-Term Memory (LSTM) network.

1. Describe the main parts of the image

  • Image caption: “Triple door mechanism”.
  • Several formulas are given in the figure, describing the calculation of the input gate (i), forget gate (f) and output gate (o) in LSTM, as well as the update method of memory cells.
  • The bottom of the picture shows the direction of data flow in the LSTM unit.

2. Explain how LSTM works

  • LSTM is designed to solve the problems of vanishing and exploding gradients, which are a challenge in traditional RNNs.
  • LSTM works through three gates (input gate, forget gate and output gate) and a memory cell to achieve long-term memory.

3. Provide additional supplements and interpretations based on the picture content

  • Input gate (i): Controls the amount of new input information. The calculation formula is i = sigmoid(wt * xt + ut * ht-1 + b).
  • Forgetting gate (f): determines which information is discarded or forgotten from the memory cells. The calculation formula is f = sigmoid(wt * xt + ut * ht-1 + b).
  • Output gate (o): Controls the amount of output information from the memory cell to the hidden state. The calculation formula is o = sigmoid(wt * xt + ut * ht-1 + b).
  • ?C: Candidate value of the current input information. The calculation formula is ?C = tanh(wt@xt + ht-1@wh + b).
  • Ct: updated memory cells. The calculation formula is Ct = f * Ct-1 + i * ?C, which represents the combination of the forgotten information selected by the forgetting gate and the new information selected by the input gate.
  • ht: The current hidden state. The calculation formula is ht = o * tanh(Ct).

The role of these gates enables LSTM to learn and remember long-term dependencies, leading to success in various sequence prediction tasks.

Let us first explain the calculation process of LSTM step by step, and then compare it with traditional RNN.

1. Calculation process of LSTM

a. Input:

  • x

    t

    xt

    xt: input of the current time step.

  • $ht-1$: The hidden state of the previous time step.
  • C

    t

    ?

    1

    Ct-1

    Ct?1: Memory cell of the previous time step.

b. Forgetting Gate (f):
Calculate which previous memories need to be retained or forgotten.

f

=

s

i

g

m

o

i

d

(

w

t

?

x

t

+

u

t

?

h

t

?

1

+

b

)

f = sigmoid(wt * xt + ut * ht-1 + b)

f=sigmoid(wt?xt + ut?ht?1 + b)

c. Input gate (i) and Memory candidate value (?C):
Decide which new memories to update.

i

=

s

i

g

m

o

i

d

(

w

t

?

x

t

+

u

t

?

h

t

?

1

+

b

)

i = sigmoid(wt * xt + ut * ht-1 + b)

i=sigmoid(wt?xt + ut?ht?1 + b)

?

C

=

t

a

n

h

(

w

t

@

x

t

+

h

t

?

1

@

w

h

+

b

)

?C = tanh(wt@xt + ht-1@wh + b)

?C=tanh(wt@xt + ht?1@wh + b)

d. Update memory cells (Ct):
Combine the output of the forget gate and the output of the input gate to update the memory cell.

C

t

=

f

?

C

t

?

1

+

i

?

?

C

Ct = f * Ct-1 + i * ?C

Ct=f?Ct?1 + iC

e. Output gate (o):
Calculate what the next hidden state should be.

o

=

s

i

g

m

o

i

d

(

w

t

?

x

t

+

u

t

?

h

t

?

1

+

b

)

o = sigmoid(wt * xt + ut * ht-1 + b)

o=sigmoid(wt?xt + ut?ht?1 + b)

f. Calculate hidden state (ht):

h

t

=

o

?

t

a

n

h

(

C

t

)

ht = o * tanh(Ct)

ht=o?tanh(Ct)

2. The difference between LSTM and traditional RNN

a. Memory cells and hidden states:

  • LSTM: There is an additional state called a “memory cell” that can store information across multiple time steps.
  • RNN: There is only one hidden state.

b. Door Mechanism:

  • LSTM: Uses three gates (input, output, and forget gates) to control the flow of information.
  • RNN: Without these gates, information is simply passed and transformed at each time step.

c. Long-term dependence:

  • LSTM: Thanks to its gate mechanism and memory cells, LSTM can handle long-term dependencies, remembering information over hundreds of time steps.
  • RNN: It is difficult to handle long-term dependencies because information is gradually lost or diluted at each time step.

d. Gradient problem:

  • LSTM: Designed to mitigate vanishing and exploding gradient problems.
  • RNN: More susceptible to vanishing or exploding gradient problems.

Summary: Although both LSTM and RNN are variants of recurrent neural networks, LSTM, through its gate mechanism and memory cell design, enables it to better handle long-term dependencies without suffering from vanishing gradient or exploding gradient problems.

Internal structure: