天天看點

cs231n - assignment1 - softmax 梯度推導

Softmax exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

This exercise is analogous to the SVM exercise. You will:

- implement a fully-vectorized loss function for the Softmax classifier

- implement the fully-vectorized expression for its analytic gradient

- check your implementation with numerical gradient

- use a validation set to tune the learning rate and regularization strength

- optimize the loss function with SGD

- visualize the final learned weights

和linear_svm一樣,主要難點是求導操作,不過softmax的求導更簡單一些。

首先還是給出 Loss 的公式:

L=1N∑iLi+λR(W)−−−(1)

其中共有 N 個樣本,每個樣本帶來的 Loss 是 Li:

Li=−logpyi=−log⎛⎝efyi∑jefj⎞⎠=−fyi+log∑jefj−−−(2)

對于每一個樣本 Xi , 由于 softmax 的分母對所有的 fj 進行了累積求和, 是以 Li 對 W 的導數對 W的每一列都又貢獻, 即 ∂LiWj 對所有的 j 都不為 0:

當 j!=yi 時:

∂Li∂Wj=efj∑jefj∂fj∂Wj=efj∑jefjXTi−−−(3)

當 j==yi 時:

∂Li∂Wj=efj∑jefj∂fj∂Wj=efj∑jefjXTi−XTi−−−(4)

對所有樣本都求出對應的Loss, 累積求和,并加上正則項即可以得到最終要求的Loss了。

上面求導數過程是把 Loss 對于 W 的導數顯示的寫出來,然後直接對 W 求導數,在這個簡單的例子中可以這樣,但是一旦網絡變得複雜了,就很難直接寫出Loss 對于要求的表達式的導數了。一種比較好的方式是利用 chain rule 逐級的求導數:

pk=efk∑jefj,Li=−logpyi−−−(5)

這裡 fk 是 softmax 層的輸出,由上面公式 (2) 可以求出 Loss 對 fk 的導數為:

∂Li∂fk=pk−1(yi=k)−−−(6)

該式子表明 Loss 對 softmax 層的輸出的導數為 pk ,并且當 k= yi 時導數項還要減去1。

把式 (6) 改寫為向量形式:

∂Li∂f=p−[0...1...](第yi維為1)−−−(6a)

現在考慮第二層 fully connected layer,也就是緊連着 softmax 的那一層全連接配接層,這一層的輸入是隐藏層的輸出 hiddenlayer[1×H] , 是以softmax的輸入 f=hidden_layer.dot(W2)+b2 , 檢查一下次元, f 為 C 維向量, W2 為 H×C 的二維矩陣, b2 為 C 維向量,沒問題。 現在就可以來求 f 對 W2 的導數了:

∂f∂W=hidden_layer.T(次元為H×1)−−−(7)

可以看到, ∂f∂W 的是全連接配接層的輸入向量。

綜合以上結果就可以求得:

∂Li∂W=∂f∂W∂Li∂f−−−(8)

最後對所有 N 個樣本寫成矩陣形式:

∂L∂f=p[N×C]−MaskMat[N×C]−−−(6m)

∂f∂W=hiddenlayerT[N×H]−−−(7m)

∂L∂W=∂f∂W∂L∂f[H×C]−−−(8m)

其中(6m)中的 MskMat為 N <script type="math/tex" id="MathJax-Element-351">N</script> 個(6a)中向量組成,具體形式可以參見如下python代碼:

# compute the gradient on scores
  dscores = probs
  dscores[range(num_examples),y] -= 
           
# softmax.py
import numpy as np
from random import shuffle

def softmax_loss_naive(W, X, y, reg):
  """
  Softmax loss function, naive implementation (with loops)

  Inputs have dimension D, there are C classes, and we operate on minibatches
  of N examples.

  Inputs:
  - W: A numpy array of shape (D, C) containing weights.
  - X: A numpy array of shape (N, D) containing a minibatch of data.
  - y: A numpy array of shape (N,) containing training labels; y[i] = c means
    that X[i] has label c, where 0 <= c < C.
  - reg: (float) regularization strength

  Returns a tuple of:
  - loss as single float
  - gradient with respect to weights W; an array of same shape as W
  """
  # Initialize the loss and gradient to zero.
  loss = 
  dW = np.zeros_like(W)

  #############################################################################
  # TODO: Compute the softmax loss and its gradient using explicit loops.     #
  # Store the loss in loss and the gradient in dW. If you are not careful     #
  # here, it is easy to run into numeric instability. Don't forget the        #
  # regularization!                                                           #
  #############################################################################
  num_train = X.shape[]
  num_classes = W.shape[]
  for i in xrange(num_train):
    scores = X[i].dot(W) 
    scores -= np.max(scores) #prevents numerical instability
    correct_class_score = scores[y[i]]

    exp_sum = np.sum(np.exp(scores))
    loss += np.log(exp_sum) - correct_class_score

    dW[:, y[i]] -= X[i]
    for j in xrange(num_classes):
      dW[:,j] += (np.exp(scores[j]) / exp_sum) * X[i]

  loss /= num_train
  loss +=  * reg * np.sum( W*W )
  dW /= num_train
  dW += reg * W

  #############################################################################
  #                          END OF YOUR CODE                                 #
  #############################################################################

  return loss, dW


def softmax_loss_vectorized(W, X, y, reg):
  """
  Softmax loss function, vectorized version.

  Inputs and outputs are the same as softmax_loss_naive.
  """
  # Initialize the loss and gradient to zero.
  loss = 
  dW = np.zeros_like(W)

  #############################################################################
  # TODO: Compute the softmax loss and its gradient using no explicit loops.  #
  # Store the loss in loss and the gradient in dW. If you are not careful     #
  # here, it is easy to run into numeric instability. Don't forget the        #
  # regularization!                                                           #
  #############################################################################
  num_train = X.shape[]
  num_classes = W.shape[]

  scores = X.dot(W)
  scores -= np.max(scores, axis = )[:, np.newaxis]
  exp_scores = np.exp(scores)
  sum_exp_scores = np.sum(exp_scores, axis = )
  correct_class_score = scores[range(num_train), y]

  loss = np.sum(np.log(sum_exp_scores)) - np.sum(correct_class_score)

  exp_scores = exp_scores / sum_exp_scores[:,np.newaxis]

  # maybe here can be rewroten into matrix operations 
  for i in xrange(num_train):
    dW += exp_scores[i] * X[i][:,np.newaxis]
    dW[:, y[i]] -= X[i]

  loss /= num_train
  loss +=  * reg * np.sum( W*W )
  dW /= num_train
  dW += reg * W
  #############################################################################
  #                          END OF YOUR CODE                                 #
  #############################################################################

  return loss, dW
           
# softmax.ipynb
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -
best_softmax = None
learning_rates = [, , ]
regularization_strengths = [, , ]

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################
params = [(x,y) for x in learning_rates for y in regularization_strengths ]
for lrate, regular in params:
    softmax = Softmax()
    loss_hist = softmax.train(X_train, y_train, learning_rate=lrate, reg=regular,
                             num_iters=, verbose=True)
    y_train_pred = softmax.predict(X_train)
    accuracy_train = np.mean( y_train == y_train_pred)
    y_val_pred = softmax.predict(X_val)
    accuracy_val = np.mean(y_val == y_val_pred)
    results[(lrate, regular)] = (accuracy_train, accuracy_val)
    if(best_val < accuracy_val):
        best_val = accuracy_val
        best_softmax = softmax
################################################################################
#                              END OF YOUR CODE                                #
################################################################################

# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy)

print 'best validation accuracy achieved during cross-validation: %f' % best_val
           

繼續閱讀