CS231n-Lecture Note02-Optimization

Recall:

In the last lecture, we learned about image classifiers, nerest neighbor and K-NN. And use cross-validation to tune hyperparameters. With the lack of nerest neighbor and K-NN, we have implemented Linear Classifier which will reduce the expense of calculation. To calculate the class’s score, we implemented a score function. The loss function is used to compare the class predicted with the actual class. There are two different losses discussed: hinge loss (SVM) and Cross-entropy loss (softmax). The difference between these two is that the Cross-entropy loss interpert the class’s score into probability.

In the loss, to find the best W, we set a penalty for the loss, which is so-called regularization. The common penatly we use is L1 norm and L2 norm.

Optimization

In the above recall, we use the loss function to quantify the quanlity of W. Next step, we need to minimize loss.

Briefly:

Loss function: quantify the quanlity of W

optimization: minimize the loss

The main thing we are doing is finding the best ‘W’.

We often use the downhill problem as a metaphor for the search for W. When we are going down the mountain, we want to choose the fastest way. The thing we need to consider is which direction we should take. There are two choices: one is a random search, and the other is to follow the slop.

It’s a bad idea to use random search to solve this problem, apparently. Which will cost a lot of time; we need to try so many times to find the time. It’s expensive.

Let’s discuss the concept of ‘W’ in the context of following the slope idea.

Computing the gradient

Numeric gradient

In 1-dimension, the derivative of a function:

\frac{df(x)}{dx} = \underset{h \to 0}{lim} \frac{f(x + h)-f(x)}{h}

The diagram shows the definition of slope:

Here is a vector of ‘W’, we need to add h to compute each dimension’s result to get the gradient dW.

In the first dimension, the h we choose is 0.0001, and the loss after adding h is 1.25322, the dW is -2.5 with that formula. Then we need to loop over all dimensions to calculate the dW.

Shown in code would be like:

def eval_numerical_gradient(f, x):
  """
  a naive implementation of numerical gradient of f at x
  - f should be a function that takes a single argument
  - x is the point (numpy array) to evaluate the gradient at
  """

  fx = f(x) # evaluate function value at original point
  grad = np.zeros(x.shape)
  h = 0.00001

  # iterate over all indexes in x
  it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
  while not it.finished:

    # evaluate function at x + h
    ix = it.multi_index
    old_value = x[ix]
    x[ix] = old_value + h # increment by h
    fxh = f(x) # evalute f(x + h)
    x[ix] = old_value # restore to previous value (very important!)

    #compute the partial derivative
    grad[ix] = (fxh - fx) / h # the slope
    it.iternext() # step to next dimension

  return grad

Partially, using the centered difference formula \frac{f(x + h)-f(x-h)}{2h} would better.

Let’s use CIFAR-10 to compute the loss.

# to use the generic code above we want a function that takes a single argument
# (the weights in our case) so we close over X_train and Y_train
def CIFAR10_loss_fun(W):
  return L(X_train, Y_train, W)

W = np.random.rand(10, 3073) * 0.001 # random weight vector
df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient

loss_original = CIFAR10_loss_fun(W) # the original loss
print 'original loss: %f' % (loss_original, )

# lets see the effect of multiple step sizes
for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
  step_size = 10 ** step_size_log
  W_new = W - step_size * df # new position in the weight space
  loss_new = CIFAR10_loss_fun(W_new)
  print 'for step size %f new loss: %f' % (step_size, loss_new)

# prints:
# original loss: 2.200718
# for step size 1.000000e-10 new loss: 2.200652
# for step size 1.000000e-09 new loss: 2.200057
# for step size 1.000000e-08 new loss: 2.194116
# for step size 1.000000e-07 new loss: 2.135493
# for step size 1.000000e-06 new loss: 1.647802
# for step size 1.000000e-05 new loss: 2.844355
# for step size 1.000000e-04 new loss: 25.558142
# for step size 1.000000e-03 new loss: 254.086573
# for step size 1.000000e-02 new loss: 2539.370888
# for step size 1.000000e-01 new loss: 25392.214036

In this code, we use 10 difference steo to compare the result.

Notice:

  • The weight update in negative way. Because we wanna loss function decrease not increase.
  • The affect of step size (learning rate), the step size decided the speed of descend. Which shows in the output as well.

Computing the gradient numercially is too slow and cost a lot. The other way is analytic gradient.

Analytic gradient

In analytic gradient, we will compute gradient withcalculus.

There are some problem when use this way to compute gradient:

1. It’s would be fast to compute when we use calculus.

2. It can be more error when we implement it.

Let’s take SVM loss for example:

? L_i =\underset{j\
eq y_i}{\sum}[max(0, w_j^Tx_i - w_{y_j}^Tx_i + \ Delta)]

After calculation:

\bigtriangledown _{w_{y_i} }Li = - (\underset{j\
eq y_i}{\sum}1(w_j^Tx_i - w_{y_j}^Tx_i + \Delta > 0))x_i ” class=”mathcode” src=”//i2.wp.com/latex.csdn.net/eq?\bigtriangledown _{w_{y_i} }Li = – (\underset{j\<br />
eq y_i}{\sum} 1(w_j^Tx_i – w_{y_j}^Tx_i & amp;plus; \Delta > 0))x_i”>( the 1 should be )</p>
<p>The  is the indicator function that is one if the condition inside is true or zero.</p>
<p><em><strong>Notice:</strong> </em></p>
<p>This gradient only works for the row of W that corresponds to the correct class.</p>
<p>for those, <img alt= the gradient is \bigtriangledown _{w_{j} }Li = - 1(w_j^Tx_i - w_{y_j}^Tx_i + \Delta > 0)x_i” class=”mathcode” src=”//i2.wp.com/latex .csdn.net/eq?\bigtriangledown _{w_{j} }Li = – 1(w_j^Tx_i – w_{y_j}^Tx_i & amp;plus; \Delta > 0)x_i”></p>
<h2>Gradient Descent</h2>
<p>The procedure of repeatedly evaluating gradient and parameter upadates is called gradient descent</p>
<h3>Vanilla Gradient Descent</h3>
<pre>while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
    # update weight with step size
  weights + = - step_size * weights_grad # perform parameter update</pre>
<h3>Mini-batch Gradient Descent</h3>
<p>With the a huge amount of dataset, it’s wasteful to compute the parameter update one by one. A common way is to compute it in batches of the training dataset.</p>
<p>For example, there are 50,000 samples in CIFAR-10, if we set 256 samples as a batch.</p>
<pre>while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights + = - step_size * weights_grad # perform parameter update</pre>
<p>There is an extreme case if mini-batch, only contains a single example. We call it <strong>Stochastic Gradient Descent (SGD).</strong></p>
<p><img alt=

Well, about other optimization, I have talked in this article.

Reference:

1. Lecture 3

2.Optimization

3. Deep learning and CV tutorial (3) | Loss function and optimization

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. OpenCV skill treeDeep learning in OpenCVImage classification 24094 people are learning the system