logistic and softmax linear classification

logistic and softmax linear classification

logistic regression

Hyper plane (Hyper plane) wt x + b = 0 divides the feature space into two parts, one part wt x + b>0 is a positive half space, and the other One copy is negative half space.

Among them, w is the normal vector of the hyperplane, and the general modulus length is 1, which determines the direction of the hyperplane; b is the offset value, which determines the distance between the hyperplane and the origin. Use gradient descent to update w and b.

Hyperplane equation: wx + b = 0; distance from x to hyperplane: |wt x + b|

softmax function

The normalized exponential function displays the results of multi-classification in the form of probability. It uses a matrix to process the eigenvalues, and finally gets the probability of the eigenvalues on each label.

w

=

w

1

,

1

w

1

,

2

?

w

1

,

no

w

2

,

1

w

2

,

2

?

w

2

,

no

?

?

?

?

w

m

,

1

w

m

,

2

?

w

m

,

no

w

T

x

i

=

z

1

=

w

1

T

x

i

z

2

=

w

2

T

x

i

?

z

no

=

w

no

T

x

i

w = \begin{vmatrix} w_{1,1} & amp; w_{1,2} & amp; \cdots & amp; w_{1,n}\ w_{2,1} & amp; w_{2 ,2} & amp; \cdots & amp; w_{2,n}\ \vdots & amp; \vdots & amp; \ddots & amp; \vdots\ w_{m,1} & amp; w_{m ,2} & amp; \cdots & amp; w_{m,n}\ \end{vmatrix} \ w^T x_i = \begin{vmatrix} z_1 = w_1^Tx_i\ z_2 = w_2^Tx_i\ \cdots\ z_n = w_n^Tx_i\ \end{vmatrix}

w=
?w1,1?w2,1wm,1w1,2?w2,2wm,2?w1,n?w2,nwm,n
?wTxi?=
?z1?=w1T?xi?z2?=w2T?xizn?=wnT?xi
?
The items in the z vector are the scores of the feature xi in each category, and the following formula is used to convert z into the probability y of each label.

the y

k

i

=

p

(

the y

=

k

x

i

)

=

e

x

p

(

z

k

)

j

=

1

k

e

x

p

(

z

j

)

Y

=

the y

1

the y

2

?

the y

no

y_k^i = p(y=k|x^i) = {exp(z_k)\over{∑_{j=1}^k exp(z_j)}} \ Y = \begin{vmatrix} y_1\ y_2\ \cdots\ y_n \end{vmatrix}

yki?=p(y=k∣xi)=∑j=1k?exp(zj?)exp(zk?)?Y=
?y1?y2yn
?
The benefits of this are:

1) The predicted probability is non-negative: use the exp exponential function

2) The sum of the probabilities of various forecast results is equal to 1: the single item is divided by the sum of the items

Experiment

1. Load the Iris dataset

In order to train the model, the processing of the data set is skipped, and it is necessary to divide the training set and the verification set.

# -*- coding: utf-8 -*-
# Import the datasets package
from sklearn import datasets

# Load the iris dataset
iris = datasets. load_iris()
# Data: sepal length, width, petal length, width
xi = iris.data
yi = iris.target # label
target_names = iris.target_names #species

2. Logistic regression

Based on numpy, a logistic-based multi-class classifier is implemented to solve the classification problem of “iris flower data”. Since logistic regression can only be used for binary classification, it is necessary to train three binary classifiers and synthesize the results of the three binary classifiers.
Font metrics not found for font: .
Define three logistic regression models, each distinguishing between 01, 02 and 12 labels:

1) If the sample label is 0: model 0 predicts 0, model 1 predicts 0, model 2 predicts 1 or 2;

2) If the sample label is 1: model 0 predicts 1, model 1 predicts 0 or 2, model 2 predicts 1;

3) If the sample label is 2: model 0 predicts 0 or 1, model 1 predicts 2, and model 2 predicts 2.

Therefore, it is only necessary to combine the prediction results of the three models, and the labels that appear most will be used as samples.

# random initialization
model = np.random.random(size = xi_0.shape[1])
# Model parameter transposition
wt = model.transpose()

# Encapsulate the calculation of predicted probability as a function
# x prediction sample, wt model parameter transposition
def predict(x, wt):
    # Multiply the feature matrix with wt to get the prediction vector
    target_result = np.matmul(x,wt)

    # Use the sigmoid activation function to map to the 0-1 interval
    p = []
    for i in range(len(target_result)):
        p.append(1/(1 + np.exp(-target_result[i])))
    return p

# learning rate
a = 0.01
# Update the model weights once
def update(wt, p, xi, y ):
    wt = wt - a* 2/np.array(xi).shape[0]* np.matmul(np.array(xi).transpose(),(np.array(p) - y))
    return wt

# x training samples, y sample labels, epoch training times
def model_train(x, y, epoch):
    # Randomly initialize the model
    model = np. random. random(size = xi. shape[1])
    # Model parameter transposition
    wt = model.transpose()
    
    # for training
    for i in range(epoch):
        # Make predictions based on current parameters
        p = predict(x,wt)
        # gradient update
        wt = update(wt, p, x, y)
    return wt

3, softmax function

Given a sample, update w in units of vectors. If the given sample label is 0:
Font metrics not found for font: .

import math
# Calculate a 3x1 vector, corresponding to the probability of each category
def predict_2(wt, x):
# 1/(1 + np.exp(-target_result[i]))
    vec = np.matmul(wt,x)
    # Convert calculation results to probabilities for each category
    y = []
    for i in range(len(vec)):
        # exp function
        y.append(math.exp(vec[i])/(math.exp(vec[0]) + math.exp(vec[1]) + math.exp(vec[2])))
    return y

# wt is the transpose of the weight matrix, y is the predicted value, xi is a single sample feature, and yi is the category of the sample
def update_2(wt, y, xi, yi):
    a = 0.01
# The value of yi is 0, 1, 2, and the column vector corresponding to the list is processed
    wt[yi] = wt[yi] - a*(y[yi] - 1)*xi
    # Process column vectors that do not correspond to categories
    for i in range(3):
        if i != yi:
            wt[i] = wt[i] - a*y[i]*xi
    return wt

def model_train_2(epoch):
    # Create a random (4,3) matrix
    w = np.random.random(size=(4,3))
    # Matrix transpose
    wt = w.transpose()
    # training batches
# print(yi)
    for i in range(epoch):
        # Use all samples for training once per round
        for j in range(len(xi)):
            y = predict_2(wt, xi[j].transpose())
            wt = update_2(wt, y, xi[j].transpose(),yi[j])
    return wt.transpose()

4. Visualization

  • For the visualization of experimental data, since the feature dimension has 4 dimensions, PCA dimensionality reduction is performed on the data first, and different colors are drawn according to the sample labels.

    For the visualization of model prediction results, after feature input, the prediction results are collected as a list, and the data_view() method can also be called for visualization.

    # Import pandas library, data visualization
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # import numpy package
    import numpy as np
    
    # PCA dimensionality reduction
    from sklearn.decomposition import PCA # Load PCA algorithm package
    pca = PCA(n_components=2)# Load the PCA algorithm, set the number of principal components after dimension reduction to 2
    reduced_x = pca.fit_transform(xi)# Reduce the dimension of the sample
    
    # Enter the label of the predicted sample and draw the image
    def data_view(target):
        # Draw scatterplots in batches
        color=['red','green','blue']# color
        # Prevent the incoming target from being a list
        target = np.array(target)
        for i in range(3):
            plt.scatter(reduced_x[target==i][:,0],reduced_x[target==i][:,1], c=color[i], label=target_names[i])
    
        # place the label
        plt. legend(loc = 'best')
        # add title
        plt.suptitle("iris data")
        # display image
        plt. show()
        
    data_view(yi)
    

5. Existing problems

No training set, test set, and validation set classification was performed.