Python Artificial Intelligence in Practice: Intelligent Monitoring

1. Background Introduction

Monitoring refers to the process of measuring, recording, and observing the characteristics, status, or behavior of an object. In modern society, traditional monitoring methods can no longer meet the needs. People are gradually turning to new digital monitoring methods, including photoelectric and other methods. Computer vision, pattern recognition, data analysis, machine learning and other technologies have become the basis for building intelligent monitoring systems.

In order to implement an intelligent monitoring system, the following key elements are required:

  1. Data collected by sensors: Converting information present in an object or environment into a form that can be processed by a computer is called signal processing. For example, RGB three-color channel data is streamed to a computer for processing, and main features are extracted through color space conversion.
  2. Data storage and retrieval: Store the collected raw data and summarize the data according to certain time intervals. Saves informational data about objects or people that can be used to train machine learning models.
  3. Model establishment: Model the collected data, perform classification, clustering, regression and other processing to obtain a prediction model.
  4. Algorithm control: Automatically run through the control algorithm to complete monitoring tasks.

Sensors, data processing, algorithms and other technologies are the basic tools for building intelligent monitoring systems, but how to combine these technologies to form a complete system is the difficulty of the project. There is no good solution yet, as long as a system with a certain level can be developed. This article takes scene object detection technology as an example to demonstrate how to build a face detection system based on deep learning.

2. Core concepts and connections

Deep Learning: Deep learning is a machine learning method that uses multi-layer neural networks for deep learning and can process high-dimensional input data. The latest research on deep learning is increasingly applied to images, text, sounds and other fields.

Face detection: The goal of deep learning is to let the computer understand and understand the faces in the picture. Currently, face detection is one of the most popular image recognition directions. The principle is to use computers to identify face areas in pictures. The recognition results include the location of the face and its attributes, such as eyes, mouth, eyebrows, etc.

Convolutional Neural Network (CNN): Convolutional Neural Network is a deep learning model consisting of a convolution layer, a pooling layer and a fully connected layer. CNN can effectively extract features of interest in images.

Bounding-box Regression Network (BBOX-RNet): The Bounding-box Regression Network is used to return the bounding box, that is, to determine the position of the face. Its basic structure is two convolutional layers and two fully connected layers, with the convolutional layer in front and the fully connected layer in the back. Its output is a regression value representing the relative position of the face region in the entire image.

3. Detailed explanation of core algorithm principles, specific operation steps and mathematical model formulas

First, we need to prepare the image to be detected and standardize it to a uniform size, for example: 128 x 128. Then load the pre-trained CNN model and set the corresponding parameters. Then, input the image to be detected into the CNN model to obtain the feature map. Here, we do not need to know the specific output, but only use the feature map to extract the features of the face area.

Assume that the size of the output feature map is $H\times W\times C$, expressed as pixels of $H \times W$. Each pixel is composed of C channels, so the number of channels is C. If the width and height of the image are W and H respectively, the size of the feature map is $H/s_h\times W/s_w$, where $s_h$ and $s_w$ represent the step size. For example, if the width of the image is 128 and the stride is 32, then the width of the feature map is 4.

For feature maps, we use maximum pooling, which is to extract the largest feature value in certain small areas. In this way, some unimportant details can be filtered out. For example, if a face is detected in the feature map, there may be many small feature points, and there is a large-size feature value among them, then this place can be considered to be the face area. For max pooling, the stride defaults to 2, which takes one every 2 pixels.

Now that we have obtained the feature values of the face region, we can use other algorithms to match it to the real face. But we still need to find the position of the face area on the original image. At this time, we need to use BBOX-RNet for regression. The basic structure of BBOX-RNet is similar to CNN, with only one convolutional layer and three fully connected layers. The convolutional layer is used to extract features, and the fully connected layer is used to regress bounding boxes. The output is a regression value representing the relative position of the face area in the original image.

Finally, we need to draw the detected area and give the corresponding text label. Here, we need to use existing annotation tools to label faces.

4. Specific code examples and detailed explanations

First, import the relevant libraries:

import cv2
from matplotlib import pyplot as plt
import numpy as np
from keras.models import load_model

Here, cv2 is used to read pictures; matplotlib is used to draw pictures; numpy is used for matrix operations; keras is used to load the trained CNN model.

Then, read the image and set the preprocessing parameters:

size = (128,128) # Set a uniform size
img = cv2.resize(img, size) # Unify the image size

Here, the image is read and set to a uniform size for later processing.

Load the trained CNN model and set the corresponding parameters:

cnn = load_model("facenet_keras.h5") # Load the trained CNN model
input_shape = cnn.layers[0].output_shape[1:3] # Get the size of the input image

Here, load the trained CNN model and set the size of the input image. Since this model has been adjusted, the dimensions of the input image are obtained directly. If you need to retrain the model, you can re-download the trained model.

Input the image into the CNN model to get the feature map:

inputs = np.zeros((1, input_shape[0], input_shape[1], 3)) # Create input array
inputs[0] = img / 255 - 0.5 # Preprocess the image
outputs = cnn.predict(inputs)[0][:, :, :-1] # Extract feature maps from the output

Here, create an input array, preprocess the image, and input the preprocessed image into the CNN model. The output result is a three-dimensional matrix, where the first two dimensions correspond to the height and width of the feature map, and the third dimension corresponds to the number of channels. Since the output of the CNN contains additional free dimensions, we only select the first two dimensions.

Next, we need to find the feature values of the face area, so we also need to set the parameters of the maximum pooling. Since the output results of CNN are very large and very close, we perform maximum pooling on the results, which can filter out some unimportant detailed information.

pool_size = (7,7) # Set the maximum pooling parameters
strides = pool_size # Set the stride size
outputs = tf.nn.max_pool(outputs, ksize=[1] + list(pool_size) + [1], strides=[1] + list(strides) + [1], padding='SAME')[0 ,:,:,:]

Here, set the maximum pooling parameters and use TensorFlow’s API to implement maximum pooling. Due to the characteristics of max pooling, we selected the value of the zeroth channel, so the output result has only one channel.

Now that we have obtained the feature values of the face area, we can use other algorithms to match it to the real face. We first randomly initialize a face area, and then each iteration will make the face area closer to the real face position. After each iteration, we check whether the face area is still within the image range, and if so, we continue to the next iteration.

Finally, we draw the detected areas and give corresponding text labels.

for i in range(10):
    bbox = []
    for j in range(len(faces)):
        y_min = faces[j]['bb'][1] - faces[j]['bb'][3]/2 * img.shape[0]
        y_max = faces[j]['bb'][1] + faces[j]['bb'][3]/2 * img.shape[0]
        x_min = faces[j]['bb'][0] - faces[j]['bb'][2]/2 * img.shape[1]
        x_max = faces[j]['bb'][0] + faces[j]['bb'][2]/2 * img.shape[1]
        height = y_max - y_min
        width = x_max - x_min
        center_y = (y_max + y_min)/2
        center_x = (x_max + x_min)/2
        y_shift = ((height//input_shape[0]) + 1)*center_y % img.shape[0]-height/2*img.shape[0]
        x_shift = ((width//input_shape[1]) + 1)*center_x % img.shape[1]-width/2*img.shape[1]
        bbox + = [[y_min + y_shift, x_min + x_shift]]

    outputs = sess.run(logits, feed_dict={images_ph: inputs})
    face_score = outputs[:, :2]
    landmark_score = outputs[:, 2:]

    score = face_score + landmark_score
    label = np.argmax(score, axis=1).astype(np.int32)

    if not len(label): break

    idx = np.where(label==1)[0][0]
    bbox = bbox[idx]
    score = score[idx, :]

    image = drawBox(image, [bbox])
    text = 'face' if score[1]>score[0] else 'not face'
    color = (255,0,0) if score[1]>score[0] else (0,255,0)
    font = cv2.FONT_HERSHEY_SIMPLEX
    cv2.putText(image,text,(int(bbox[0]), int(bbox[1])),font,0.5,color,thickness=1)

plt.imshow(image[...,::-1]); plt.show()

Here, we first randomly initialize a face area, and then each iteration will make the face area closer to the real face position. After each iteration, we check whether the face area is still within the image range, and if so, we continue to the next iteration.

Finally, we draw the detected areas and give corresponding text labels. Since matplotlib uses RGB format to display images by default, we need to flip the image channel order.