XOR and nonlinearity of neural network

In deep learning, we often use neural networks to solve various problems. However, not all problems can be solved by neural networks. A well-known example of this is the XOR problem.

Exclusive OR (XOR) is a logical operator that means 0 if two are the same and 1 if they are different. For example, 1 XOR 1 = 0, 1 XOR 0 = 1, 0 XOR 0 = 0, 0 XOR 1 = 1. Now consider the following XOR problem: Given two inputs x and y, we want to train a model to output their XOR result z=x XOR y. If we think of this problem as a binary classification problem where the corresponding label is 1 for z=1 and 0 for z=0, then this problem is a very simple classification problem. However, when we use a single neuron or a linear classifier to solve this problem, we find that we cannot achieve good results.

A single neuron includes several inputs

x_1, x_2, …, x_n

x1?,x2?,…,xn?, each input has a corresponding weight

w_1, w_2, …, w_n

w1?,w2?,…,wn?. The neuron weights and sums these inputs, and obtains the final output y=f(wx + b) through the activation function f. If we set the activation function to be a step function, we can think of the neuron as a binary classifier.

So, let’s try to solve the XOR problem using a single neuron. For the input vector

(

)

(x_1, x_2)

(x1?,x2?), their XOR result is

x_1

x1? XOR

x_2

x2?. We can have these two inputs as the two inputs of the neuron

x_1

x1? and

x_2

x2?, and then pass the weight and activation function to get the output:

the y

(

)

y = f(w_1x_1 + w_2x_2 + b)

y=f(w1?x1? + w2?x2? + b)

However, we will find that no matter how to adjust the weights and biases (bias), the correct output cannot be obtained. This is because the XOR problem is not a linearly separable problem. Simply put, we cannot draw a straight line on a two-dimensional plane to completely separate the points labeled 0 and 1. There is no way to completely separate the points labeled 0 and 1 with a straight line. A single neuron is essentially a straight line, so it cannot solve the XOR problem.

Next, let’s look at the structure of a linear classifier. Linear classifiers can also be viewed as a form of single neuron.

The structure of a linear classifier is very similar to a single neuron, only its activation function is different. The activation function used by linear classifiers is usually a sign function (sign function), whose output is -1 or 1:

(

)

{

≤

f(x) = \begin{cases} 1, & amp; x>0 \ -1, & amp; x\leq 0 \end{cases}

f(x)={1,?1,?x>0x≤0?

Therefore, we can think of a linear classifier as a special case of a single neuron. Likewise, linear classifiers cannot solve XOR problems.

So why can’t a single neuron and a linear classifier solve the XOR problem? This is because the XOR problem cannot be completely separated by a hyperplane. On a two-dimensional plane, a circle can be used to completely separate the points labeled 0 and 1, but this circle is not a hyperplane. In a higher-dimensional space, the XOR problem is more complicated, and it is impossible to completely separate the points labeled 0 and 1 through a hyperplane.

There are many ways to solve this problem, the most common way is to use a multi-layer neural network. A multilayer neural network can be viewed as a collection of individual neurons (or nodes). Each node will receive the output of the previous layer of nodes as input, and perform weighted summation and activation function processing on it to obtain the output of the current node. In this way, stacking up layer by layer forms a multi-layer neural network. Through multi-layer neural networks, we can learn more complex feature representations to solve the XOR problem.

The figure below shows the process of a simple two-layer neural network solving the XOR problem:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-cAaQphxz-1683886595591)(null)]

First, the input vector

(

)

(x_1,x_2)

(x1?,x2?) is input to the two nodes of the first layer. Each node will perform weighted summation and activation function processing on the input to obtain the output

h_1

h1? and

h_2

h2?. These two outputs are then fed into a node in the second layer. Similarly, this node will also perform weighted summation and activation function processing on the input to obtain the final output

the y

It is worth noting that each node in this simple two-layer neural network uses a step function as the activation function. Due to the discontinuous and non-differentiable properties of the step function, it is not suitable for use in practical applications. In practical applications, we usually use other smoother activation functions, such as sigmoid function, tanh function or ReLU function.

Finally, let’s look at the Python code again, showing how to use a multi-layer neural network to solve the XOR problem:

import numpy as np

# define inputs and labels
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Define the neural network structure
input_size = 2
hidden_size = 2
output_size = 1

# Randomly initialize the weights
W1 = np.random.randn(input_size, hidden_size)
W2 = np.random.randn(hidden_size, output_size)

# Define the sigmoid function as the activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Define the forward pass process
def forward(X):
    # First layer: input to hidden layer
    z1 = np.dot(X, W1)
    h1 = sigmoid(z1)

    # Second layer: hidden layer to output
    z2 = np.dot(h1, W2)
    y_hat = sigmoid(z2)

    return y_hat

# define the loss function
def loss(y, y_hat):
    return np. mean((y - y_hat) ** 2)

# Define the backpropagation process
def backward(X, y, y_hat):
    # Second layer: output to hidden layer
    delta2 = (y - y_hat) * y_hat * (1 - y_hat)
    dW2 = np.dot(h1.T, delta2)

    # First layer: hidden layer to input
    delta1 = np.dot(delta2, W2.T) * h1 * (1 - h1)
    dW1 = np.dot(X.T, delta1)

    return dW1, dW2

# train the neural network
learning_rate = 0.1
for i in range(5000):
    # forward pass
    y_hat = forward(X)

    # Compute the loss function
    l = loss(y, y_hat)

    # Backpropagation
    dW1, dW2 = backward(X, y, y_hat)

    # update weights
    W1 + = learning_rate * dW1
    W2 + = learning_rate * dW2

    # Output the value of the loss function every 1000 iterations
    if i % 1000 == 0:
        print("Iteration {}: Loss = {}". format(i, l))

# Test the neural network
y_pred = forward(X)
print("Final predictions: ")
print(y_pred)

In the above code, we first define the input vector X and the corresponding label y. Then, we define the structure of the neural network: the input layer contains 2 nodes, the hidden layer contains 2 nodes, and the output layer contains 1 node. Next, we randomly initialized the weights W1 and W2, and defined the activation function sigmoid. In the forward propagation process, we pass the input X through the first layer and the second layer to get the output y_hat. During backpropagation, we calculated the error of each node and updated the weights. Finally, we use the trained network to make predictions on the input data and print out the predictions.

It can be seen that when using a multi-layer neural network to solve the XOR problem, we need at least two layers (one hidden layer) to get the correct result. This phenomenon is called the “universal approximation theorem” (universal approximation theorem), which states that any continuous function can be approximated by a single hidden layer neural network with enough nodes. This is one of the reasons why multi-layer neural networks are more common than single-layer neural networks in practical applications.

To summarize, the XOR problem cannot be solved using a single neuron or a linear classifier because it is not a linearly separable problem. It needs to be solved using a multi-layer neural network. Multilayer neural networks can learn more complex feature representations by stacking multiple nodes to solve nonlinear problems.