Assignment #1

1.Softmax

(a)證明softmax對輸入中的常量偏移保持不變，即對于任何輸入向量x和任何常量c，

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

式中，x+c意味着将常數c加到x的每個維上。記住：

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

注：在實踐中，我們利用這一性質，在計算數值穩定性的softmax機率時，選擇

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

。（即從x的所有元素中減去其最大元素）

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

(b)給出n行和d列的輸入矩陣，使用（a）部分的優化方法計算每行的softmax預測。

def softmax(x):
    """Compute the softmax function for each row of the input x.

    It is crucial that this function is optimized for speed because
    it will be used frequently in later code. You might find numpy
    functions np.exp, np.sum, np.reshape, np.max, and numpy
    broadcasting useful for this task.

    Numpy broadcasting documentation:
    http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

    You should also make sure that your code works for a single
    D-dimensional vector (treat the vector as a single row) and
    for N x D matrices. This may be useful for testing later. Also,
    make sure that the dimensions of the output match the input.

    You must implement the optimization in problem 1(a) of the
    written assignment!

    Arguments:
    x -- A D dimensional vector or N x D dimensional numpy matrix.

    Return:
    x -- You are allowed to modify x in-place
    """
    orig_shape = x.shape
    
    # YOUR CODE HERE
    if len(x.shape) > 1:
        # Matrix
        x -= np.max(x, axis=1, keepdims=True)  # axis=1按行求最大值
        x = np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)
    else:
        # Vector
        x -= np.max(x)
        x = np.exp(x) / np.sum(np.exp(x))
    # END YOUR CODE
    
    assert x.shape == orig_shape
    return x

2.Neural Network Basics

(a)推導出sigmoid函數的梯度，并證明它可以重寫為函數值的函數（在表達式中隻有σ（x），而不是x）。假設輸入x是這個問題的标量。回想一下，sigmoid函數是：

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

(b)當使用交叉熵損失函數進行評估時，推導softmax函數的梯度，即當預測為

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

時，找出關于輸入向量為θ的softmax函數的梯度。記住交叉熵函數為：

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

其中y是one-hot形式的标簽向量，

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

是對所有類别的預測機率向量。

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

(c)推導關于輸入x的單隐層神經網絡的梯度（找到

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

，其中

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

是神經網絡的損失函數）。該神經網絡在隐藏層使用sigmoid激活函數，在輸出層使用softmax函數。假設y是one-hot形式的标簽向量，且使用交叉熵損失函數。（可使用

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

作為sigmoid函數梯度的簡寫）

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

記住，正向傳播如下：

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

(d)在該神經網絡中有多少個參數？假設輸入為

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

，輸出為

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

，隐藏層單元個數為H。

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

(e)實作sigmoid激活函數及其梯度。

def sigmoid(x):
    """
    Compute the sigmoid function for the input here.

    Arguments:
    x -- A scalar or numpy array.

    Return:
    s -- sigmoid(x)
    """

    # YOUR CODE HERE
    s = 1 / (1 + np.exp(-x))
    # END YOUR CODE
    
    return s


def sigmoid_grad(s):
    """
    Compute the gradient for the sigmoid function here. Note that
    for this implementation, the input s should be the sigmoid
    function value of your original input x.

    Arguments:
    s -- A scalar or numpy array.

    Return:
    ds -- Your computed gradient.
    """
    
    # YOUR CODE HERE
    ds = s * (1 - s)
    # END YOUR CODE
    
    return ds

(f)為了更友善地debug，實作一個梯度檢測的程式。

# First implement a gradient checker by filling in the following functions
def gradcheck_naive(f, x):
    """ Gradient check for a function f.

    Arguments:
    f -- a function that takes a single argument and outputs the
         cost and its gradients
    x -- the point (numpy array) to check the gradient at
    """

    rndstate = random.getstate()  # 傳回一個目前生成器的内部狀态的對象
    random.setstate(rndstate)  # 傳入一個先前利用getstate方法獲得的狀态對象，使得生成器恢複到這個狀态。
    fx, grad = f(x) # Evaluate function value at original point
    h = 1e-4        # Do not change this!

    # Iterate over all indexes ix in x to check the gradient.
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])  # 疊代對象nditer提供了一種靈活通路一個或者多個數組的方式
    while not it.finished:
        ix = it.multi_index

        # Try modifying x[ix] with h defined above to compute numerical
        # gradients (numgrad).

        # Use the centered difference of the gradient.
        # It has smaller asymptotic error than forward / backward difference
        # methods. If you are curious, check out here:
        # https://math.stackexchange.com/questions/2326181/when-to-use-forward-or-central-difference-approximations

        # Make sure you call random.setstate(rndstate)
        # before calling f(x) each time. This will make it possible
        # to test cost functions with built in randomness later.

        # YOUR CODE HERE
        x[ix] += h  # x+h
        random.setstate(rndstate)
        f1 = f(x)[0]  # f(x+h)

        x[ix] -= 2 * h  # f(x-h)
        random.setstate(rndstate)
        f2 = f(x)[0]  # f(x-h)

        numgrad = (f1 - f2) / (2 * h)  # (f(x+h) - f(x-h)) / 2h

        x[ix] += h  # 還原ix
        # END YOUR CODE
        
        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print("Gradient check failed.")
            print("First gradient error found at index %s" % str(ix))
            print("Your gradient: %f \t Numerical gradient: %f" % (
                grad[ix], numgrad))
            return

        it.iternext() # Step to next dimension

    print("Gradient check passed!")

(g)實作擁有一個sigmoid隐層的神經網絡的前向傳播和反向傳播過程。

推導過程：

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

代碼實作：

def forward_backward_prop(X, labels, params, dimensions):
    """
    Forward and backward propagation for a two-layer sigmoidal network

    Compute the forward propagation and for the cross entropy cost,
    the backward propagation for the gradients for all parameters.

    Notice the gradients computed here are different from the gradients in
    the assignment sheet: they are w.r.t. weights, not inputs.

    Arguments:
    X -- M x Dx matrix, where each row is a training example x.
    labels -- M x Dy matrix, where each row is a one-hot vector.
    params -- Model parameters, these are unpacked for you.
    dimensions -- A tuple of input dimension, number of hidden units
                  and output dimension
    """

    # Unpack network parameters (do not modify)
    ofs = 0
    Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

    W1 = np.reshape(params[ofs:ofs + Dx * H], (Dx, H))
    ofs += Dx * H
    b1 = np.reshape(params[ofs:ofs + H], (1, H))
    ofs += H
    W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
    ofs += H * Dy
    b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

    # YOUR CODE HERE
    # Note: compute cost based on `sum` not `mean`.
    # forward propagation
    m = len(X)
    z1 = np.dot(X, W1) + b1  # (N, H)
    h = sigmoid(z1)  # (N, H)
    z2 = np.dot(h, W2) + b2  # (N, Dy)
    y_hat = softmax(z2)  # (N, Dy)
    cost = np.sum(-labels * np.log(y_hat)) / m

    # backward propagation
    gradz2 = y_hat - labels
    # (N, Dy) 交叉熵的導數dz2為y_hat-y
    gradW2 = np.dot(h.T, gradz2) / m
    # (H, Dy) dW2為J對W2求偏導,dW2=h.T*dz2/N,其中h.T(H, N),dz2(N, Dy)
    gradb2 = np.sum(gradz2, axis=0, keepdims=True) / m
    # (1, Dy) db2為J對b2求偏導,db2=y_hat-y(N, Dy)在列方向上求和
    gradz1 = np.dot(gradz2, W2.T) * sigmoid_grad(h)
    # (N, H) dz1為J對z1求偏導,dz1=dz2*W2.T*sigmoid_grad(z1),其中dz2(N, Dy),W2.T(Dy, H),sigmoid`(z1)(N, H),哈達瑪積
    gradW1 = np.dot(X.T, gradz1) / m
    # (Dx, H) dW1為J對W1求偏導,dW1=X.T*dz1,其中X.T(Dx, N),dz1(N, H)
    gradb1 = np.sum(gradz1, axis=0, keepdims=True) / m
    # (1, H) db1為J對b1求偏導,db1=dz1(N, H)在列方向上求和
    # END YOUR CODE
    
    # Stack gradients (do not modify)
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(), gradW2.flatten(), gradb2.flatten()))

    return cost, grad

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1

Assignment #1

1.Softmax

2.Neural Network Basics

繼續閱讀

文本分類之 residual-connection+selfAttention的詞向量平均模型

新聞文本分類-06 基于Bert的文本分類

seq2seq模型 + Attention機制

elasticlunr.js 最新版本v0.6.7釋出啦應用示例為什麼你需要elasticlunr.js?

RNN/LSTM學習資料總結

使用中文維基百科進行GloVe實驗

從詞向量衡量标準到全局向量的詞嵌入模型GloVe再到一詞多義的解決方式衡量标準Evaluation引子全局向量的詞嵌入應用對一詞多義的思考Reference

NLP︱進階詞向量表達（一）——GloVe（理論、相關測評結果、R&python實作、相關應用）一、理論簡述二、測評三、Glove實作&R&python四、相關應用

GloVe與word2vec的差別，及GloVe的缺陷

更别緻的詞向量模型(一)：simpler glove

glove_python安裝（避免編譯錯誤）

python 分析qq聊天記錄

[一起學BERT]（一）：BERT模型的原理基礎Self-Attention機制理論Multi-head Self-Attention注意力機制位置編碼Transformer理論BERT理論

ELMO BERT GPT

BERT、Elmo、GPT一、發展曆史二、bert三、ERNIE四、GPT—transformer的decoder

人工智能如何有效地運用于自然語言處理

CS224N刷題——Assignment1.1&amp;1.2_Softmax&amp;神經網絡基礎Assignment #1

Assignment #1

1.Softmax

2.Neural Network Basics

繼續閱讀

CS224N刷題——Assignment1.1&1.2_Softmax&神經網絡基礎Assignment #1