Keras-CNN convolutional neural network training to recognize handwritten digits model

Directory

1. Configuration

2. Model training

3. Test set accuracy

4. Generalization to applications

5. Improvement points

The open source projects on the Internet are mixed, and none of them can even run. Many of them cannot run because the package version is not enough. So let’s start with the configuration of the project.

1. Configuration

python

python 3.11.0

External Libs (mainly the following)

numpy 1.26.0
keras 2.14.0
tensorflow 2.14.0
opencv-python 4.8.0.76

Mnist dataset

Downloaded online through the mnist module integrated with Kares, no need to manually download it from yann
(If you need to download, choose http, otherwise you need an account and password)

Try to reproduce (Lec3 is the corresponding code)

<https://github.com/Liyanhao1209/Machine_Vision_Practice.git>

2. Model training

2.0 Wheel

No matter what the network architecture looks like, there must always be wheels (convolution, pooling, fully connected) corresponding to several layers, and pilot packages.

import numpy as np
from keras.models import Sequential
from keras.layers import Conv2D, MaxPool2D, Dense, Flatten
from keras.utils import to_categorical
import tensorflow._api.v2.compat.v1 as tf
import keras
from keras.datasets import mnist
tf.disable_v2_behavior()

tf.disable_v2_behavior is because there are two versions of tf’s API, and we use the first one, so the behavior of the second version of the API is blocked.

  1. Sequential: My personal understanding is that the CNN framework itself is like a queue container, and subsequent layers are entered one by one. Then during model training, the operations of each layer are performed in the order of entry into the queue.
  2. Conv2D: convolutional layer
  3. MaxPool2D: Pooling layer
  4. Flatten: Flatten layer
  5. Dense: fully connected layer
  6. Others: some tool functions

2.1 Data preprocessing

First import the data (essentially download it online)

(X_train, y_train), (X_test, y_test) = mnist.load_data()

(I will use opencv to read the data later and see what these handwritten numbers look like)

Put a sample here

It probably looks like this, white text on a black background, 28px * 28px. The training set is 60,000 images and the test set is 10,000 images, supervised.

The data generally used for training and testing are greyscale images. But the Mnist data set is already a grayscale image. (Single channel) So when we write the user program later, we need to convert the RGB to grayscale image first, so there is no need to convert it here.

But here because of the API’s own problems, we need to explicitly declare that our data is indeed a single-channel grayscale image.

img_x, img_y = X_train.shape[1], X_train.shape[2]
X_train = X_train.reshape(X_train.shape[0], img_x, img_y, 1)
X_test = X_test.reshape(X_test.shape[0], img_x, img_y, 1)

Then normalize. (to ensure accuracy)

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

Then you need to vectorize the scalar tag (required by the API)

To put it simply, if the label is i, then a vector with length k (specified by the caller, usually 10, 9 in the following example), the i-th bit is 1, and the rest are 0 is generated. Each time a label is generated Such a vector is finally put together into a matrix.

e.g.

from keras.utils.np_utils import *
#Category vector definition
b = [0,1,2,3,4,5,6,7,8]
#Call to_categorical to convert b according to 9 categories
b = to_categorical(b, 9)
print(b)
 
The execution results are as follows:
[[1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]

This method is called: one_hot encoding (one-hot encoding/one-bit effective encoding)

Why ohE? Machine Learning Mastery

At this point, the data preprocessing calculation is over.

2.2 Network Architecture

The architecture diagram is as follows:

  1. Starting from the input layer, the data set is 28*28 single-channel grayscale image
  2. The first layer of convolution: kernel size 5*5
  3. The first layer of pooling: maximum pooling, pool size 2*2
  4. Second layer of convolution: kernel size 5*5
  5. The second layer of pooling: maximum pooling, pool size 2*2
  6. flatten layer
  7. The first fully connected layer: activation function relu
  8. The second layer of fully connected layer: activation function softmax

Push each layer into Sequential according to the above order of entry.

model = Sequential()
model.add(Conv2D(32, kernel_size=(5,5), activation='relu', input_shape=(img_x, img_y, 1)))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Conv2D(64, kernel_size=(5,5), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(1000, activation='relu'))
model.add(Dense(10, activation='softmax'))

2.3 Model training

First compile the model and choose a loss function and optimizer

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

train

model.fit(X_train, y_train, batch_size=128, epochs=10)

Evaluation (accuracy of 10 rounds of iterative training set is 99.02%, I forgot to take a screenshot, it will take a long time to run again)

score = model.evaluate(X_test, keras.utils.to_categorical(y_test, num_classes=10))
print('acc', score[1])

Save the model and reuse it

model.save('cnn.h5')

This area, including the previous neural network architecture, is a black box to me, and I can only adjust more parameters.

3. Test set accuracy

According to the test set data, predict the result and then compare it with the supervised value

import numpy as np
from keras.datasets import mnist
from keras.models import load_model

(X_train, y_train), (X_test, y_test) = mnist.load_data()
model = load_model('cnn.h5')

predictions = np.argmax(model.predict(X_test),axis=1)
correct = 0
for i in range(len(y_test)):
    if predictions[i] == y_test[i]:
        correct + = 1
print("accuracy" + str(correct / len(predictions)))

Output result:

accuracy0.9912

Paste an intermediate variable

It is indeed 9912/10000 (9999 is the index, starting from 0, so there are 10000)

According to general requirements, an accuracy of 0.99 is considered sufficient.

4. Generalization to applications

Read the test set through opencv and write it to a specified folder. Later, we can use the pictures in this designated folder as input for the user program we wrote (of course you can also handwrite it yourself)

# getPics.py
import cv2
from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

path = "C:\Users\Administrator\Desktop\hwNums\"
for i in range(0,100):
    cv2.imwrite(path + str(i) + ".png", X_train[i])

Read 100 pictures here

One thing to note here is that the input of the model we trained is an ndarray of 28281, so when reusing this model, we need to make the input data also satisfy 28281. The picture I read is 2828, but the user may input a picture that is not 2828, so the picture size must be adapted here.

Generally speaking, we may think of the resize() function in cv2, but resize will lose a large number of features during scaling, which is very detrimental to the prediction results of the model, so we need to change the scaling method.

4.0 Image Pyramid

What is an image pyramid?

An image pyramid is an image collection composed of multiple sub-images of different resolutions of an image. It is generated by continuously reducing the sampling rate of an image. The smallest image may only have one pixel. The image below is an example of an image pyramid. As can be seen from the figure, the image pyramid is a collection of images arranged in a pyramid shape with gradually decreasing resolution from bottom to top.

Typically, the base of the image pyramid is the high-resolution image to be processed (the original image), while the top is its low-resolution approximation. As you move toward the top of the pyramid, the size and resolution of the image decreases continuously. Typically, the width and height of the image are reduced to half of their original size for each level moved up.

The simplest image pyramid can be obtained by continuously deleting even rows and columns of the image. For example, there is an image whose size is NN. After deleting its even rows and even columns, an image of (N/2)(N/2) size is obtained. After the above processing, the size of the image becomes one-fourth of its original size. By repeating the process, the image pyramid of the image can be obtained.

You can also obtain an approximate image of the original image by first filtering the original image, and then delete the even rows and even columns of the approximate image to obtain a downsampled result. There are a variety of filters to choose from.

Domain filter: Use the neighborhood average to calculate an approximate image of the original image. This filter is capable of producing an average pyramid.

Gaussian filter: Use Gaussian filter to filter the original image to obtain Gaussian pyramid. This is the method used by the OpenCV function cv2.pyrDown().

Gaussian pyramid is generated by continuously using Gaussian pyramid filtering and sampling. The process is as follows:

After the above processing, the original image and the result image obtained by each downsampling together form a Gaussian pyramid. For example, the original image can be called layer 0, the result image of the first downsampling is called the first layer, the result image of the second downsampling is called the layer 2, and so on.

To put it simply, a powerful image pyramid (Laplacian pyramid) can prevent the image from losing so much information when it is scaled.

Here we choose Gaussian pyramid and reduce it by 1/2 each time until the next reduction is smaller than the target size (28). However, this may not be enough. The image size may not be 28 after the last execution. At this time, the doubled scaling will If it is useless, you need to use resize() (but at this time, resize will not lose so much information)

def adjustImg(e, target):
    while min(e.shape[0], e.shape[1]) > target:
        e = cv2.pyrDown(e, )
    e = cv2.resize(e, (target, target), interpolation=cv2.INTER_CUBIC)
    return e

(Use Gaussian pyramid to gradually converge to the target size)

4.1 Writing user programs

Requirement: The user provides a folder of pictures (only png format is supported), and uses the trained model to identify the handwritten digits in each picture.

First, the user is required to provide the folder path (absolute path), ensuring that there are only png images in it and no files in any other format (including subfolders).

path = input("Enter the path to the folder where the handwritten digit pictures are located:")

Then all png images in the folder are read and echoed to the user. (The reason for this is because the reading order is not necessarily the order in which the files in the folder are stored)

Here, since opencv’s API for displaying images in the window is blocked (other operations cannot be performed in parallel without continuing the operation), I chose to open a sub-thread. The sub-thread is responsible for displaying the image, and the main thread is used for calculations. In this way, the two things of displaying pictures and calculating predictions can be operated in parallel.

files = readFiles(path)
images = getImages(files)
t_show = threading.Thread(target=showImages, args=(files,))
t_show.start()

Then, just like during training, ensure that the data format entering the model is correct (declare single channel, normalized)

img_x, img_y = images.shape[1], images.shape[2]
images = images.reshape(images.shape[0], img_x, img_y, 1)
images = images.astype('float32')
images /= 255

Load model

model = load_model('cnn.h5')

Predict, print results

predictions = model.predict(images)
print('predictions')
print(np.argmax(predictions, axis=1))

If 100 images in the test set are used as input, the results are as follows:

There are 100 test echoes here, which is too many. I cut off the first 6 and you can see 5, 0, 3, 5, 3, 6. They are all correct and basically no problem.

But as mentioned before, the data given by users may be different. For example, I used a drawing pad to hand-write several pictures of different sizes. Then pass it in and take a look.

As you can see from the icon, the sizes are different. The smallest one (the third one, 7) is 28*28, and the other sizes are different. The shapes are also different, some are square and some are rectangular.

Let’s see what the result is

As you can see, the correct results are: 2, 3, 7, 5, 5, 5, 7. The model prediction results are: 1,1,7,1,3,5,7, with a hit rate of 3/7. Probably only 42%. It can be said to be very low.

What’s the reason for this? We can see that the two 7 pictures, one 2828, one 5656, and 28*28 (that is, the data used for model training) are close to each other. Then they will not lose too much key information when the image is zoomed, so the model can recognize it.

But in this picture 2, the strokes are relatively thin, and it is still 560*560. It loses too many features when scaling, so it is unrecognizable.

Let’s look at the intermediate variables again (this is another set, I used white pen with black background, which is the same as the training set, but the results are not good, which shows that the main problem is not the font color)

As you can see, the dot plot in the intermediate variable still clearly shows the number 3, but our model was “spoiled” by the very beautiful handwritten digits in the training set. Even this relatively clear one was not recognized. Not coming out.

This is over-fitting. Our model absorbs so many features from the training set (60,000 images) that it cannot recognize other slightly “deformed” data outside of this training set.

5. Improvements

  1. As mentioned just now, our model may be overfitting due to the training set, so we need to modify the training set (add more handwritten digits with different styles), or stop training at the right time (don’t absorb too many characteristics of the training set), or add some counterexamples: supervision is not necessarily positive example supervision, but may also be counterexample supervision. A large difference makes it easier to train a model with a smaller generalization error.
  2. Try to handle the input data better. When scaling the data, even if I use the image pyramid, a lot of information is still lost. I tried searching for some proportional scaling APIs, but the results were average (the resize effect was even worse than it is now). This one takes more time to discover.
  3. Replace with a better performing model. To put it bluntly, it means adjusting parameters to create a CNN handwriting recognition model with stronger generalization ability, not only the parameters, but also the network architecture. (Convolution deconvolution/pooling depooling, etc.)

code show as below:

#cnn_practice.py
import numpy as np
# from torchvision.datasets import MNIST#Get the MNIST data set
from keras.models import Sequential
from keras.layers import Conv2D, MaxPool2D, Dense, Flatten
from keras.utils import to_categorical
import tensorflow._api.v2.compat.v1 as tf
import keras
from keras.datasets import mnist
import matplotlib.pyplot as plt
import cv2
tf.disable_v2_behavior()

#--------------------------------Preparing data--------------- ------------------
# trainData = MNIST(root="/MNIST_data", train=True, download=True)
# testData = MNIST(root="/MNIST_data",train=False,download=True)
# train_images, train_labels = trainData.train_data,trainData.train_labels
# test_images, test_labels = testData.test_data,testData.test_labels
(X_train, y_train), (X_test, y_test) = mnist.load_data()
img_x, img_y = X_train.shape[1], X_train.shape[2]


# plt.imshow(X_train[1])
# plt.show()


X_train = X_train.reshape(X_train.shape[0], img_x, img_y, 1)
X_test = X_test.reshape(X_test.shape[0], img_x, img_y, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

y_train = keras.utils.to_categorical(y_train, num_classes=10)
#--------------------------------Keras---------------- ----------------
#Build model
model = Sequential()
model.add(Conv2D(32, kernel_size=(5,5), activation='relu', input_shape=(img_x, img_y, 1)))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Conv2D(64, kernel_size=(5,5), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(1000, activation='relu'))
model.add(Dense(10, activation='softmax'))

#Model compilation
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

#train
model.fit(X_train, y_train, batch_size=128, epochs=10)

#evaluate model
score = model.evaluate(X_test, keras.utils.to_categorical(y_test, num_classes=10))
print('acc', score[1])
# #Save the model to disk.
model.save('cnn.h5')
#
# #Load the model from disk later using:
# model.load_weights('cnn.mnist')

# Predict on the first 8 test images.
predictions = model.predict(X_test[:10])

# Print our model's predictions.
print('model''s predictions')
print(np.argmax(predictions, axis=1))

# Check our predictions against the ground truths.
print('checker')
print(y_test[:10])
# testAccuracy.py
import numpy as np
from keras.datasets import mnist
from keras.models import load_model

(X_train, y_train), (X_test, y_test) = mnist.load_data()
model = load_model('cnn.h5')

predictions = np.argmax(model.predict(X_test),axis=1)
correct = 0
for i in range(len(y_test)):
    if predictions[i] == y_test[i]:
        correct + = 1
print("accuracy" + str(correct / len(predictions)))
#getPics.py
import cv2
from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

path = "C:\Users\Administrator\Desktop\hwNums"
for i in range(0,100):
    cv2.imwrite(path + str(i) + ".png", X_train[i])
#handWritingNumberRecognizationApplication.py
import os
import threading
from glob import glob

import cv2
import numpy as np
from keras.models import load_model


def readFiles(f_path):
    return glob(
        os.path.join(f_path + "\*.png")
    )


def showImages(eles):
    list = []
    for i in range(len(eles)):
        img = cv2.imread(eles[i])
        cv2.imshow('img' + str(i), adjustImg(img, 640))
        cv2.imshow('img' + str(i) + str(0), adjustImg(img, 28))
    cv2.waitKey(0)


def getImages(eles):
    l = []
    for i in range(len(eles)):
        img = cv2.imread(eles[i])
        l.append(img)
    img = l[0]
    ans = np.empty((len(l), 28, 28))
    # ans = np.empty((len(l),img.shape[0],img.shape[1]))
    for i, e in enumerate(l):
        e = adjustImg(e, 28)
        ans[i] = cv2.cvtColor(e, cv2.COLOR_BGR2GRAY)
    return ans


def adjustImg(e, target):
    while min(e.shape[0], e.shape[1]) > target:
        e = cv2.pyrDown(e, )
    e = cv2.resize(e, (target, target), interpolation=cv2.INTER_CUBIC)
    return e


# C:\workplace\Machine_Vision_Practice\Lec3\pic\hwNums
if __name__ == '__main__':
    path = input("Enter the path to the folder where the handwritten digit pictures are located:")
    files = readFiles(path)
    images = getImages(files)
    t_show = threading.Thread(target=showImages, args=(files,))
    t_show.start()
    img_x, img_y = images.shape[1], images.shape[2]
    images = images.reshape(images.shape[0], img_x, img_y, 1)
    images = images.astype('float32')
    images /= 255

    model = load_model('cnn.h5')

    predictions = model.predict(images)
    print('predictions')
    print(np.argmax(predictions, axis=1))

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeArtificial intelligenceDeep learning 377867 people are learning the system