CS231n-Course Note01-Image Classification with Linear Classifiers

Core Concepts:

K-Nearest Neighbor
Linear classifiers: SVM,
Linear classifiers:Softmax
Two-layer neural network
Image features

As a core task in Computer Vision, Image Classification has two basic data-driven approaches which is K-nearest neighbor and linear classifier.

In Machine Learning, the Data-Driven approach can be summarized into three key steps:

Collect a dataset of images and labels
Use Machine Learning algorithms to train a classifier
Evaluate the classifier on new images

Image Classification

Let’s talk about the Nearest Neighbor Classifier first.

Nerest Neighbor Classifier

There are a set of images that contain different classes of things. In these images, we want to find images of cats.

In the first step of the data-driven approach to machine learning, we need to label all the images.

Here, we have 5 classes for training, which are labeled.

For finding the target images, the core solution of the Nerest Neighbor Classifier is to figure out the distance metric of the target image and predict the image.

To calculate the distance metric, we can use matrix.

Using the one-color channel of test image minus the training image, we can get the distance. The test image and training image we talk about are matrices. Training data consists of examples and corresponding labels, where the example shape is N x D, where N is the number of training examples and D is the number of features. Label is 1-dimension of size N.

class NearestNeighbor:
    def __init__(self):
        pass
    
def train(self, X, y):
        # X (N, D) y(N,)
        self.Xtr = X
        self.ytr = y

def predict(self, X):
        num_test = X.shape[0]
        
        Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

        for i in range(num_test):
            # L1 distance
            # axis = 1 is the feature,
            """
            compute the distance's and sum absolute value differences
            """
            distances = np.sum(np.abs(self.Xtr - X[i,:]), axis=1)
            """
            get the index with smallest distance
            the smallest, the closest
            """
            min_index = np.argmin(distances)
            # predict the label of the nearest example
            Ypred[i] = self.ytr[min_index]

    return Ypred

Intuitively, let’s use a diagram to explain the calculation of distance.

Here, each image size is 3x32x32 (color channel, height pixel, width pixel). The $X_{tr}$ shape is NxD, which we have already discussed. N is examples, and D is vector with features, which is 3x32x32 in this example.

$X_{tr} - X[i,:]$ shows each row of $X_{tr}$ minus < img alt="X[i, : ]" class="mathcode" src="//i2.wp.com/latex.csdn.net/eq?X[i, : ]">, after pixel-wise and get the absolute value of each element. We need to sum along the horizontal axis to get the distances. Then, we will get a smallest number’s index by argmin() method, by using this index to get the label in $y_{tr}$ , which is also the predict label of ith example.

Example image classification dataset: CIFAR-10.

This dataset consists of 10 classes. The images of each class are shown below:

Suppose now we are given the CIFAR-10 dataset for training and the right side’s for prediction. In the nearest neighbor classifier, we need to calculate the distance of each training class with the right side’s images by using L1 distance:

$d_1(I_1,I_2) = \sum_p|I_1^p-I_2^p|$

And find the smallest distance; the label of the smallest distance is the label of the prediction image.

The choice of distance:

There are so many ways to compute distance between two vectors.

Another common choice is L2 distance.

$d_2(I_1 - I_2) = \sqrt{\sum_p(I_1^p-I_2^p)^2}$
distances = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1))

K-Nerest Neighbor

Different from only using the label of the nerest image, K-Nerest Neighbor finds the top K closet images and votes on the label of the test image.

Intuitively, higher values of k have a smoothing effect that makes the classifier more resistant to outliers [1].

Here’s an example of an NN classifier with a 5-NN classifier. There are 3 classes (red, blue, green). The color region shows the decision based on these data. In the NN classifier, each data point has a color region, even the outlier data points. While the 5-NN classifier smooths over these irregularities, in the color region of the 5-NN classifier, we can see that the top 5 closet data points are the same color, and as a result, we can vote on the color of this region.

def compute_distances_no_loops(self, X):
"""
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    # Compute the l2 distance between all test points and all training #
    # points without using any explicit loops, and store the result in #
    # dists. #

    # Output: sqrt((x-y)^2)
    # (x-y)^2 = x^2 + y^2 - 2xy
    test_sum = np.sum(np.square(X), axis=1) # num_test x 1

    train_sum = np.sum(np.square(self.X_train), axis=1) # num_train x 1

    inner_product = np.dot(X, self.X_train.T) # num_test x num_train

    dists = np.sqrt(-2 * inner_product + test_sum.reshape(-1, 1) + train_sum) # broadcast
    return dists

def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Input:
    dists - A num_test x num_train array where dists[i, j] gives the distance
            between the ith test point and the jth training point.

    Output:
    y - A vector of length num_test where y[i] is the predicted label for the
        ith test point.
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in xrange(num_test):
        # A list of length k storing the labels of the k nearest neighbors to
        # the ith test point.
        closest_y = []
        y_indicies = np.argsort(dists[i, :], axis = 0)
        closest_y = self.y_train[y_indicies[:k]]
        y_pred[i] = np.argmax(np.bincount(closest_y))
    return y_pred

But what value of k is the best choice? And what is the best distance to use? (L1, L2)

These are all called hyperparameters

Validation sets for Hyperparameter tuning

In the previous section, we talked about training data and test data. The classifier is learned using training data, and it uses the test data to predict the image. The trained classifier cannot be used to tweak the hyperparameter.

So, we put forward new data called the validation set, which is split from the training data but smaller.

If the training data is small, cross-validation might be the better way to tune the hyperparameters.

cross-validation

The idea of cross-validation is split data into folds, try each fold as validation set and average the results, which shown below:

Linear Classifier

In the previous section, we talked about using distance to predict the labels. In KNN we need space to store all training data for future comparison with test data. If the dataset is big enough, the more space will cost, which is not benefit for compute. Secondly, it’s cost a lot to compare the test data with all training data.

To develop a more powerful apporch, there are two major components needed: 1. score function: maps the raw data to class scores. 2. loss funciton: quantifies the difference between predicted values and the actual values.

Parameterized mapping from images to label scores

Let’s take a concrete example, in CIFAR-10 dataset, the image’s size is 3x32x32. The D is 3072 apparently. Training dataset of image $x_i\in R^D$ , which associated with label $y_i,(i=1...N)$ , $y_i\in 1...K$ . In CIFAR-10, the N = 50,000, k=10.

In linear classifier, we add a new parameter W (matrix), and add parameter b(vector) which is the bias between predicted values and actual values.

The score function can be defined as:

$f(x_i,W,b) = Wx_i + b$

In this function, the shape of x is (3072, 1), W is (10, 3072), b is (10, 1), so the shape of f(x, W) is (10, 1).

The following diagram shows an example of mapping an image to class scores.

In this example, we assume the image only has 4 pixels, not considering the color channels. And we have 3 classes (red (cat), green (dog), blue (ship) class). The shape of W is 3×4, b is 3×1, $x_i$ is 4×1.

Notice:

1. When we calculate $Wx_i$ , the shapes of W and $x_i$ are not aligned for mathematical operations. We need to use transpose (.T) to transpose one of the matrices to make the dimensions match.

2. The images are stretched into high-dimensional column vectors.

As we saw above, each row of the weight is the classifier of one of the classes. Let score function as a Linear equation.

In Linear equation, W can be seen as the slope. In the score function, where W and b are matrices and vectors, the score function can be seen as a hyperplate. With the change of W, the hyperplate will rotate in different directions.

In a linear equation, the parameter ‘b’ represents the distance to the x-axis (y=b); if b is changed, the new line is the parallel line of the original one. In the context of the score function, bias b decides the distance to move in that direction.

Bias trick

It’s a little bit redundant to compute weight and bias separately. The common trick is to combine these two parameters into one. Meanwhile, we need to add one more dimension to $x_i$ , which is constant 1. The new score function will be like this:

? $f(x_i, W) = Wx_i$

With the CIFAR-10 example, $x_i$ is now [3071×1] instead of [3072×1 ], and W is now [10×3073] instead of [10×3072]. The illustration could be more clearly shown below:

we put the bias vector conbine with W. Meanwhile, xi needs to add one dimension with constant of 1.

Image preprocessing

In the example above, we use the raw pixel values, which range from [0…255]. In Machine Learning, we need to use normalization to process the input feature; each pixel could be thought of as a feature.

We will use each feature to subtract the mean, where the pixels range from approximately [-127, 127].

Further common preprocessing is to scale each input feature so that the value ranges from [-1,1].

There are some conditions that make it hard to use a linear classifier:

Loss function

In the previous section, we set a score function to calculate the class score. But unhapiness, we received a terrible score on the class cat.

In the score of cat, we got 2.9 which is smaller than image frog.

The bad score of cat which remained us to find the good W. The loss function help us to measure the score. The loss will be high if we’re doing a poor job, vice versa.

Let’s give an example of this loss, ther are N size of traning examples. The dataset is $\begin{Bmatrix} (x_i, y_i) \end{Bmatrix}_{i=1}^N$ , where < img alt="x_i" class="mathcode" src="//i2.wp.com/latex.csdn.net/eq?x_i"> is the dataset, $y_i$ is the label. The loss can be:

? ? ? ? ? $L = \frac{1}{N} \sum_i Li(f(x_i, W), y_i)$

First we compute the score of $x_i$ by using score function $f (x_i, W)$ . Then, we can compute each dataset’s loss, and sum them. Last, the average of the sum is the real loss.

Multiclass Support Vector Machine loss

The first loss we will talk is Multiclass Support Vector Machine loss. The SVM loss wanna the correct class get a high score than the incorrect class with a margin $\Delta$ .

Recall the process to get $L_i$ , we get the class score first. Now, we save the scores into vector s. The score for the class is $j^{th}$ the element. The $s_{y_i}$ is the acutal classes. The formula can be:

? ? ? ? ? ? ?< img alt="L_i = \sum_{j\ eq y_i}max(0, s_j - s_{y_i} + \Delta )" class="mathcode" src="//i2.wp.com/latex.csdn.net /eq?L_i = \sum_{j\ eq y_i}max(0, s_j - s_{y_i} & amp;plus; \Delta )">

Let’s take three classes, for example:

Here, we have 3 classes and 3 training examples. The score of the first column, which is an image cat, is 3.2, 5.1, -1.7. The $\Delta$ is set to be 1.

$L_1 = max(0, 5.1-3.2 + 1) + max(0, -1.7-3.2 + 1)\ =max(0, 2.9) + max(0,-3.9)\ =2.9 + 0\ =2.9$

$L_2 = 0, L_3=12.9$

$L = (2.9 + 0 + 12.9)/3=5.27$

With the result, we can see that, the correct class’s loss is smaller than the other two.

The diagram shows the relation between score for scorrect class and scores for other classes.

In the SVM loss, we want the correct class’s score to be higher than the incorrect class’s score with an added margin $\Delta$ . If any other class’s score is inside the red area, there will be an accumulated loss. Otherwise, the loss is 0.

In this section, we use $f(x_i;W) = Wx_i$ as score function, so the loss function can be rewritten as:

? ? ? ? ? $L_i = \sum_{j\ eq y_i} max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta )$

Where $w_j$ is j-th row of W reshaped as a column, which we have discussed before.

There is a problem when we set the parameter W. Suppose there is a dataset with a set of W that can classify this data correctly. Which means all $L_i = 0$ . The W is not unique.

The way to figure out this problem is to multiple a new parameter $\lambda$ with W, where $\lambda > 1″ class=”mathcode” src=”//i2.wp.com/latex.csdn.net/eq?\lambda > 1″>. But the loss is still zero . In other words, we wanna add some things to remove the ambiguity of W. We can do by extending the loss with aregularization penalty R(W). <blockquote> 1. The most common penalty is L2 norm, with the L2norm we can discourage large weights ? ? ? ? ? ? ? <img alt=$

2. We wanna a sample model, which means there are more 0 in W, theL1 norm might be the best choice.

? ? ? ? ? ? ?< img alt="R(W) = \sum_k\sum_l|W_{k,l}|" class="mathcode" src="//i2.wp.com/latex.csdn.net/eq?R(W) = \sum_k\sum_l|W_{k,l}|">

In the expression above, we can make up the new loss with regularization penatly:

? ? ? ? ? ? $L=\ underbrace{\frac{1}{N}\sum_iL_i}_{data\ loss} + \underbrace{\lambda R(W)}_{regularization \ loss}$

? $L=\frac{1}{N}\sum_i\sum_{j\ eq y_j}[max(0, f(x_i;W) _j\ -f(x_i;W)_{y_i}) + \Delta] + \lambda\sum_k\sum_lW_{k,l}^2$

Let’s take an example for explain this more clearly.

Suppose the input vector is $x = [1,1,1,1]$ , the weights are $w_1=[1,0,0,0],\w_2=[0.25,0.25,0.25,0.25]$ , then the $w_1^Tx=w_2^Tx_2=1$ . Both weights get the same dot product, but the L2 penatly of w1is 1.0, w2 is 0.25. w2 is smaller than w1, which means the w2 is better. Intuitively, the value of w2 is smaller and more diffuse. This encourage classifier to take all input dimensional in account.

Hyperparameter chose of $\Delta \ and \ \lambda$

We haven’t talked about $\Delta$ . Partically, It’s safe to set < img alt="\Delta = 1.0" class="mathcode" src="//i2.wp.com/latex.csdn.net/eq?\Delta = 1.0"> in call class.

Here, we got the formula of loss, next, we need to find the weight W that can minimize loss.

Softmax classifier

As a multiple classes classifier, Softmax, interpret raw classifier scores as probabilities.

In the example above, the score function calculate the score of ‘cat’. The softmax will interpret the scores as probabilities by using this formula below:

? ? ? ? ? ? $P(Y =k|X=x_i) = \frac{e^{s_k}}{\sum_je^{s_j}}$

$s_j$ is the j-th class’s score, $s_k$ is the actual label’s score.

Then we use this formula to calculate the probability of score we compute before.

As we can see, the value is unnormalized. We need the sum of these probabilities equal to 1. we need to normalize class probabilities. The Li would be like: ? ? ? ? ? ? ? ? ? ? $L_i = -log(\frac{e^{s_{y_i}}}{\sum_je^{s_j}}) \ or \ L_i = -s_{ y_j} + log\sum_je^{s_j}$

The Loss $L\boldsymbol{}$ would be:

? ? ? ? ? $L = \frac{1}{N} \sum_i[-log(\frac{e^{s_{y_i}}}{\sum_je^{s_j}})] + \lambda R(W)$

Then the probabilities with normalized will be:

With the normalize the probabilities sum to 1

Practical issues: Numeric stability

When we are computing the intermediate terms $e^{s_{y_i}} \ and\ \sum_je^{s_j}$ might be large due to the exponentials. We can multiply with a constant C:

? ? ? ? ? ? $\frac{ e^{s_{y_i}}}{\sum_je^{s_j}} = \frac{Ce^{s_{y_i}}}{C\sum_je^{s_j}}=\frac{e^{s_{y_i} + logC}}{\sum_je^{s_j + logC}}$

The common choice for C is to set $logC = -max_js_j$ . Which shown in code would be like:

s = np.array([123, 456, 789]) #the score of 3 classes
probabilities = np.exp(s) / np.sum(np.exp(s)) # after with exponential the values might be blowup

"""
instead: shift the highest value to 0
"""
s -= np.max(s) # s becomes [-666, -333, 0]
probabilities = np.exp(f) / np.sum(np.exp(s))

Name conventions:

1. SVM classifier uses hinge loss (max-margin loss).

$L_i = \sum_{j\ eq y_i}max(0, s_j-s_{y_j} + 1)$

2. Softmax classifier uses cross-entropy loss.

$L_i = -log (\frac{e^{s_{y_i}}}{\sum_je^{s_j}} )$

SVM vs. Softmax

Here is an example of the difference between the SVM and Softmax classifiers for one datapoint.

Briefly, the Softmax compute the probability of each class, with the values normalized the output probabilities will approach a uniform distribution.

Reference:

1. Image Classification Problem

2.Linear Classification