NLP multi-cyclic neural network sentiment analysis 8

Article directory

- Code display
- code intent
- Code interpretation
- Introduction to knowledge points
- Parameter calculation of model structure
- - 1. Embedding Layer
  - 2. SimpleRNN Layer
  - 3. Fully connected layer (Dense Layer)
  - 4. The second fully connected layer (Dense Layer)

Code display

import pandas as pd
import tensorflow astf

# Build RNN neural network
tf.random.set_seed(1)
df = pd.read_csv("../data/Clothing Reviews.csv")
print(df.info())

df['Review Text'] = df['Review Text'].astype(str)
x_train = df['Review Text']
y_train = df['Rating']

from tensorflow.keras.preprocessing.text import Tokenizer

# Create an index for the dictionary, the default dictionary size is 20000
dict_size = 14848
tokenizer = Tokenizer(num_words=dict_size)
# jieba: stop words, punctuation marks, parts of speech...
tokenizer.fit_on_texts(x_train)
print(len(tokenizer.word_index), tokenizer.index_word)

# # Encode the text conversion sequence of the comment
x_train_tokenized = tokenizer.texts_to_sequences(x_train)

# # Convert unequal length lists to equal lengths by specifying the length.
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_comment_length = 120
x_train = pad_sequences(x_train_tokenized, maxlen=max_comment_length)

for v in x_train[:10]:
    print(v, len(v))

# Build RNN neural network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding
import tensorflow astf

rnn = Sequential()
# For rnn, first perform the word vector operation
rnn.add(Embedding(input_dim=dict_size, output_dim=60, input_length=max_comment_length))

rnn.add(SimpleRNN(units=100)) # The second layer builds 100 RNN neurons

rnn.add(Dense(units=10, activation=tf.nn.relu))

rnn.add(Dense(units=6, activation=tf.nn.softmax)) # Output the classification results
rnn.compile(loss='sparse_categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
print(rnn.summary())
result = rnn.fit(x_train, y_train, batch_size=64, validation_split=0.3, epochs=1)
print(result)
print(result.history)

Code intent

The main purpose of this code is to build a simple RNN (Recurrent Neural Network) to classify the reviews in “Clothing Reviews.csv”. Review text is converted into numerical sequences, and these sequences are then used to train an RNN model to predict the review’s rating.

Process description:

Set a random seed: Ensure repeatable results.
```
tf.random.set_seed(1)
```
Read data: Use pandas to read the “Clothing Reviews.csv” file and print the data information.
```
df = pd.read_csv("../data/Clothing Reviews.csv")
print(df.info())
```
Data preprocessing:
- Convert ‘Review Text’ column to string type.
- Extract training data x_train and labels y_train from the data frame.
```
df['Review Text'] = df['Review Text'].astype(str)
x_train = df['Review Text']
y_train = df['Rating']
```
Text Tokenization:
- Use Tokenizer for text tokenization, creating a dictionary that maps each word to an integer value.
- Learn the comment text by calling the fit_on_texts method.
```
dict_size = 14848
tokenizer = Tokenizer(num_words=dict_size)
tokenizer.fit_on_texts(x_train)
```
- Convert comment text to a sequence of integers.
```
x_train_tokenized = tokenizer.texts_to_sequences(x_train)
```
Sequence padding: To ensure that all sequences are of the same length, use pad_sequences to pad or truncate the sequences.
```
max_comment_length = 120
x_train = pad_sequences(x_train_tokenized, maxlen=max_comment_length)
```

Build RNN model:

Initialize a sequential model Sequential.
Add an Embedding layer to convert vocabulary into fixed-size vectors.
Add a SimpleRNN layer with 100 neurons.
Add two fully connected layers Dense for classification tasks.

Compile the model and set up the loss function, optimizer and evaluation criteria.

rnn = Sequential()
rnn.add(Embedding(input_dim=dict_size, output_dim=60, input_length=max_comment_length))
rnn.add(SimpleRNN(units=100))
rnn.add(Dense(units=10, activation=tf.nn.relu))
rnn.add(Dense(units=6, activation=tf.nn.softmax))
rnn.compile(loss='sparse_categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

Model structure

Model: "sequential"
______________________________________________________________
 Layer (type) Output Shape Param #
================================================== ===============
 embedding (Embedding) (None, 120, 60) 890880
                                                                 
 simple_rnn (SimpleRNN) (None, 100) 16100
                                                                 
 dense (Dense) (None, 10) 1010
                                                                 
 dense_1 (Dense) (None, 6) 66

Model Summary: Prints a summary of the model, showing details of each layer.
```
print(rnn.summary())
```
Train the model: Use the extracted data to train the RNN model, and use 30% of the data as the validation set. Train for 1 epoch.
```
result = rnn.fit(x_train, y_train, batch_size=64, validation_split=0.3, epochs=1)
```
Print results: Display the training history of the model.
```
print(result)
print(result.history)
```

In summary, this code first performs data preprocessing and then builds and trains an RNN model to classify reviews.

Code interpretation

sure! The following is the code interpreted line by line:

tf.random.set_seed(1):
- Set the global random seed to 1 to ensure the repeatability of the random process.
df = pd.read_csv(“…/data/Clothing Reviews.csv”):
- Use pandas to read the CSV file and assign its content to df.
print(df.info()):
- Print a brief summary of df, including the number of non-null values in each column, data type and other information.
df[Review Text’] = df[Review Text’].astype(str):
- Convert ‘Review Text’ column in DataFrame to string type.
x_train = df[Review Text’]:
- Extract the contents of the ‘Review Text’ column from df and assign it to x_train.
y_train = df[Rating’]:
- Extract the contents of the ‘Rating’ column from df and assign it to y_train.
from tensorflow.keras.preprocessing.text import Tokenizer:
- Import the text tokenization tool from the TensorFlow library.
dict_size = 14848:
- Set the vocabulary size to 14848.
tokenizer = Tokenizer(num_words=dict_size):
- Initialize a Tokenizer object with a maximum vocabulary size of dict_size.
tokenizer.fit_on_texts(x_train):

Build a vocabulary for the tokenizer object based on the text content in x_train.

print(len(tokenizer.word_index), tokenizer.index_word):

Prints the vocabulary size and specific word-index mapping.

x_train_tokenized = tokenizer.texts_to_sequences(x_train):

Convert the text in x_train into an integer sequence and assign it to x_train_tokenized.

from tensorflow.keras.preprocessing.sequence import pad_sequences:

Import the sequence filling tool from the TensorFlow library.

max_comment_length = 120:

Set the maximum comment length to 120.

x_train = pad_sequences(x_train_tokenized, maxlen=max_comment_length):

Pad or truncate the sequence in x_train_tokenized to a uniform length (120).

for v in x_train[:10]: print(v, len(v)):

Print the first 10 sequences and their lengths in x_train.

from tensorflow.keras.models import Sequential and other import statements:

Import the required models and layers from the TensorFlow library.

rnn = Sequential():

Initialize a Sequential model object and assign it to rnn.

rnn.add(Embedding(…)):

Add an Embedding layer to the model.

rnn.add(SimpleRNN(units=100)):

Add a SimpleRNN layer with 100 neurons.

rnn.add(Dense(units=10, activation=tf.nn.relu)):

Add a fully connected layer with 10 neurons and use the ReLU activation function.

rnn.add(Dense(units=6, activation=tf.nn.softmax)):

Add an output layer with 6 neurons and use the Softmax activation function.

rnn.compile(…):

Compile the model, specifying the loss function, optimizer, and evaluation criteria.

print(rnn.summary()):

Prints a summary of the model showing the number of parameters for each layer.

result = rnn.fit(…):

Use the specified training data and validation data set to train the model, and assign the training results to result.

print(result):

Print the results of the training process.

print(result.history):

Print historical data during training, such as loss and accuracy for each epoch.

The main purpose of this code is to use Recurrent Neural Network (RNN) to classify the reviews in “Clothing Reviews.csv”.

Introduction to knowledge points

sure! Here is a detailed introduction to each important function:

tf.random.set_seed(1):
- Function: Set a global random seed to ensure the repeatability of the random process.
pd.read_csv(“…/data/Clothing Reviews.csv”):
- Function: Use the pandas library to read CSV files and return a DataFrame object.
df[Review Text’].astype(str)：
- Function: Convert the ‘Review Text’ column in the DataFrame to string type.
Tokenizer(num_words=dict_size)：
- Function: Initialize a Tokenizer object, which can convert text into a sequence of integers.
- Parameters: num_words represents the maximum number of words that the Tokenizer object will use. This means that only the most frequently dict_size words will be considered.
tokenizer.fit_on_texts(x_train):
- Function: Build a vocabulary for the Tokenizer object based on the provided text data.
tokenizer.texts_to_sequences(x_train):
- Function: Convert a list of text into a sequence list of integers, where the integers are indices of words in the vocabulary.
pad_sequences(x_train_tokenized, maxlen=max_comment_length)：
- Function: Convert a list of integer sequences into a 2D Numpy array. Sequences that are insufficient in length will be filled, and sequences that exceed the length will be truncated.
- Parameter: maxlen defines the maximum length of the sequence.
Sequential():
- Function: Initializes a linear stacking model, allowing layers to be added sequentially.
Embedding(input_dim=dict_size, output_dim=60, input_length=max_comment_length):
- Function: Convert integer tokens to dense vectors.
- Parameters: input_dim is the size of the vocabulary, output_dim is the dimension of the embedding vector, and input_length is the length of the input sequence.
SimpleRNN(units=100):

Feature: Add a SimpleRNN layer, which is a simplified version of RNN.
Parameters: units defines the number of RNN units.

Dense(units=10, activation=tf.nn.relu) and Dense(units=6, activation=tf.nn.softmax):

Feature: Add a fully connected layer.
Parameters: units define the number of neurons in the layer, and activation is the activation function.

rnn.compile(…):

Function: Compile the model and prepare it for training.
Parameters: loss defines the loss function, optimizer defines the optimization algorithm, and metrics defines the criteria for model evaluation.

rnn.fit(x_train, y_train, batch_size=64, validation_split=0.3, epochs=1):

Function: Train the model.
Parameters: batch_size defines the number of samples used for each gradient update, validation_split defines the proportion of data used for validation, and epochs defines the number of training cycles.

These functions together complete the process of data processing, model building and training. Hopefully these detailed introductions will help you understand the code better!

Parameter calculation of model structure

rnn = Sequential()
rnn.add(Embedding(input_dim=dict_size, output_dim=60, input_length=max_comment_length))
rnn.add(SimpleRNN(units=100))
rnn.add(Dense(units=10, activation=tf.nn.relu))
rnn.add(Dense(units=6, activation=tf.nn.softmax))
rnn.compile(loss='sparse_categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

Model structure

Model: "sequential"
______________________________________________________________
Layer (type) Output Shape Param #
================================================== ===============
embedding (Embedding) (None, 120, 60) 890880
\t                                                                 
simple_rnn (SimpleRNN) (None, 100) 16100
\t                                                                 
dense (Dense) (None, 10) 1010
\t                                                                 
dense_1 (Dense) (None, 6) 66

The red box in the figure marks the number of parameters in each layer of the neural network. The number of parameters for each layer is calculated based on the input and output sizes of the layer and the type of layer. Here is how the parameters of each layer are calculated:

1. Embedding Layer

The number of parameters is the vocabulary size (input_dim) multiplied by the embedding dimension (output_dim). If the vocabulary has 20,000 words and each word is mapped to a 60-dimensional space, the number of parameters is 20,000 * 60 = 1,200,000. The figure shows 898,880, which implies that the vocabulary size may be about 14,981 (since 898,880 / 60 ≈ 14,981).

(None, 120, 60) is the matrix input to RNN. None represents the number of comments. Each comment has different lengths (we unify it as 120 because the maximum is 120), and the features of each word are 60.

(None, 120, 60) – 》 [Number of comments, length of each comment, word vector] –》[Number of comments, word vector]

The recurrent neural network is a sequence structure. Each comment has 120 words. These 120 words will be entered into the neural network one by one. Each word has 60 features. Assuming that the neural network processes one word per second, 120 words takes 2 minutes, and each word comes in with 60 features. In the sequential neural network, these 120 words are entered one by one over time, but the final structure of each word, [number of comments, word vector] (this is what goes into the neural network

2. SimpleRNN Layer

The number of parameters of a simple RNN layer consists of three parts: the weight from the input to the hidden state (input_dim * units), the weight from the hidden state to the hidden state (units * units ), and offset terms (units). The calculation formula is (input_dim + units) * units + units. If the input dimension is 60 (from the embedding layer) and the number of units is 100, the number of parameters is (60 + 100) * 100 + 100 = 16,100.

3. Fully connected layer (Dense Layer)

The number of parameters of a fully connected layer is the number of input units multiplied by the number of output units, plus the bias term of the number of output units. If the previous layer had 100 units and this layer has 10 units, the number of parameters is 100 * 10 + 10 = 1,010.

4. The second fully connected layer (Dense Layer)

Same as above, if the previous layer has 10 units and this layer has 6 units, the number of parameters is 10 * 6 + 6 = 66.

Parameter calculation is an important factor in considering computational cost and model capacity when building and training neural networks. Each parameter needs to be learned through training data, so the number of parameters directly affects the learning ability and training time of the model.