Linear [SVM] mathematical principles and algorithm implementation

1. Mathematical Principles

SVM is a type of supervised classification algorithm. Its general idea is: Assume that there are two types of points in the sample space, as shown in the figure below. We hope to find a dividing hyperplane to separate these two types of samples. We hope that this interval can Maximize to make the model’s generalization ability the strongest.

As shown in the figure above, the expressions of the positive hyperplane, negative hyperplane and decision hyperplane are as shown in the figure. Assume that there are two support vectors Xm and Xn on the positive and negative hyperplanes. The expressions are ①, ②, ① respectively. Subtract ② to get ③, ③ can be written in the form of ④ (where, w vector = (w1, w2), Xm vector = (X1m, X2m), Xn vector = (X1n, X2n)).

Selection Assume that there are two vectors Xp = (X1p, X2p) and Xo (X1o, Therefore, the w vector is a vector perpendicular to the decision hyperplane, that is, the normal vector of the decision hyperplane.

Now we go back to equation ④ and we can know that the distance d between the positive and negative hyperplanes can be expressed as: d = 2/w norm.

Now let’s look at the constraints. All points above the positive hyperplane yi = 1, and all points below the negative hyperplane yi = -1. Therefore, the constraint expression can be written as:

So this optimization problem is transformed into a typical convex optimization problem:

First, we use Lagrange multipliers to rewrite the loss function into a form that takes into account constraints:

The above formula is called the Lagrange function, where αi is called the Lagrange multiplier. At this moment, what we need to solve is not only the parameter vector and intercept, we also need to solve the Lagrange multiplier, and our Xi and Yi are both our known characteristic matrices and labels. Taking the partial derivatives of w and b respectively for the above equation, we can get the following conclusion:

Here we introduce the Lagrangian dual function. For any Lagrangian function L(x,α), there is a corresponding dual function g(α) with only the Lagrangian multiplier as the only parameter. If the optimal solution of L(x,α) exists and can be expressed as min L(x,α), and the optimal solution of the dual function also exists and can be expressed as max g(α), then we can define the dual difference, That is, the difference between the optimal solution of the Lagrangian function and the optimal solution of its dual function:

If △ = 0, it is said that there is a strong dual relationship between L (x, α) and its dual function. At this time, we can solve the optimal solution of its dual function instead of solving the optimal solution of the original function. Here we can get the minimum value of the original function by finding the maximum value of the dual function. For a strong duality relationship to exist, the KKT condition must be met:

Once the KKT condition is satisfied, we can find the value of α by finding the maximum value of the dual function. After finding α, we can solve for w and b by combining the above expressions, and then obtain the expression of the decision boundary. Once the expression of the decision boundary is obtained, the decision boundary and its related hyperplane can be used for classification.

2. Algorithm implementation

Let’s first import the corresponding module:

from sklearn.datasets import make_blobs
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import numpy as np

Use the make_blot function to draw the coordinates of the scatter plot, and draw it with plt.scatter:

X,y = make_blobs(n_samples=50, centers=2, random_state=0,cluster_std=0.6)
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow")#rainbow rainbow color
plt.xticks([])
plt.yticks([])
plt.show()

Create a subgraph object for subsequent operations:

ax = plt.gca() #Get the current subgraph, if it does not exist, create a new subgraph

To draw the decision hyperplane, we draw the grid points via the following code block:

#Get the maximum and minimum values of the two coordinate axes on the plane
xlim = ax.get_xlim()
ylim = ax.get_ylim()
 
#Form 30 regular data between the maximum value and the minimum value
axisx = np.linspace(xlim[0],xlim[1],30)
axisy = np.linspace(ylim[0],ylim[1],30)
 
axisy,axisx = np.meshgrid(axisy,axisx)
#We will use the two-dimensional array formed here as X and Y in our contour function
#Use the meshgrid function to convert two one-dimensional vectors into feature matrices
#The core is to broadcast the two feature vectors in order to obtain the abscissa and ordinate of so many coordinate points as y.shape * x.shape
 
xy = np.vstack([axisx.ravel(), axisy.ravel()]).T
#Where ravel() is a dimensionality reduction function, vstack can stack multiple one-dimensional arrays with consistent structures row by row.
#xy is the grid that has been formed, which is a dense collection of points spread across the entire canvas.
 
plt.scatter(xy[:,0],xy[:,1],s=1,cmap="rainbow")
 
#Understand the functions of meshgrid and vstack
a = np.array([1,2,3])
b = np.array([7,8])
#How many coordinates will be obtained by combining pairs?
#The answer is 6, namely (1,7),(2,7),(3,7),(1,8),(2,8),(3,8)
 
v1,v2 = np.meshgrid(a,b)
 
v1
 
v2
 
v = np.vstack([v1.ravel(), v2.ravel()]).T

Next draw the decision boundary with the following code block:

#Modeling, calculate the corresponding decision boundary through fit
clf = SVC(kernel = "linear").fit(X,y)#Calculate the corresponding decision boundary
Z = clf.decision_function(xy).reshape(axisx.shape)
#Important interface decision_function returns the distance to the decision boundary corresponding to each input sample
#Then convert this distance into the structure of axisx. This is because the drawing function contour requires that the structure of Z must be consistent with X and Y.

#First there must be a scatter plot
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow")
ax = plt.gca() #Get the current subgraph, if it does not exist, create a new subgraph
#Draw the decision boundary and the hyperplane parallel to the decision boundary
ax.contour(axisx,axisy,Z
           ,colors="k"
           ,levels=[-1,0,1] #Draw three contour lines, namely, Z is -1, Z is 0 and Z is 1.
           ,alpha=0.5#Transparency
           ,linestyles=["--","-","--"])
 
ax.set_xlim(xlim)#Set the x-axis value
ax.set_ylim(ylim)

You can see that the decision boundary and positive and negative hyperplanes have been drawn.

We can package the above process into a function:

#Wrap the above process into a function:
def plot_svc_decision_function(model,ax=None):
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    x = np.linspace(xlim[0],xlim[1],30)
    y = np.linspace(ylim[0],ylim[1],30)
    Y,X = np.meshgrid(y,x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    ax.contour(X, Y, P,colors="k",levels=[-1,0,1],alpha=0.5,linestyles=["--","-","--"])
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
 
#Then the entire drawing process can be written as:
plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap="rainbow") # Draw a scatter plot
clf = SVC(kernel = "linear").fit(X,y) # Calculate the decision boundary
plot_svc_decision_function(clf) # Draw the decision boundary

The following are some properties of clf:

clf.predict(X)
#Classify the samples in X according to the decision boundary, and the returned structure is n_samples
 
clf.score(X,y)
#Return the average accuracy given the test data and labels
 
clf.support_vectors_
#Return support vector coordinates
 
clf.n_support_#array([2, 1])
#Return the number of support vectors in each class