第5篇-《Attention Is All You Need》

《Attention Is All You Need》閱讀心得分享

- 論文原文連結
- 論文導讀
- - 序列模型介紹
- 不同種類attention的機制
- - Multi-layer perceptron（Bahdanau et al. 2015）
  - Bilinear（Luong et al. 2015）
  - Dot Product（Luong et al. 2015）
  - Scaled Dot Product（Vaswani et al. 2017）
- 論文中的self attention 中的Scaled Dot Product細緻講解
- 論文中的Multi-Head Attention細緻講解
- 論文中的Position-wise Feed-Forward Networks細緻講解
- 論文中的Positional Encoding細緻講解
- 代碼複現、詳細講解及我的Github位址

論文原文連結

《Attention Is All You Need》

論文導讀

序列模型介紹

NLP領域裡，有很多序列問題。比如語音識别、機器翻譯、情感分類、圖檔描述、摘要生成、問答系統等等。

那什麼叫做序列模型呢？我認為序列模型應該指的是輸入和輸出的資料均為序列的模型。序列模型會将輸入序列轉換為輸出序列。序列模型可以分為一到多、多到一、多到多模型。一到多模型指的就是輸入時一個，但輸出是一串序列，比如圖像描述。多到一模型指的是輸入是一串序列，輸出是一個标簽。比如情感分類模型。多到多模型分為兩種模型，一種指的是輸入序列和輸出序列的長度是一樣，比如NER任務或者其他序列标注的問題。相當于是給輸入的序列中的每個字元，都各自打上一個标簽。另一種就是輸入序列和輸出序列的長度是不一樣的，比如機器翻譯。

目前，多到多的序列轉換模型基本上都是使用RNN或者CNN或其變種。RNN的序列模型可以處理變長度的輸入。但是RNN存在以下缺點，一個是無法并行運算，另一個是序列太長，會導緻資訊丢失。CNN可以并行運算，并且在encoder階段可以很好的捕捉到局部依賴，但對長遠的依賴關系，需要多層卷積才能實作，也就是受到CNN的receptive field限制。Neural GPU、ByteNet和ConvS2S等都是CNN裡面比較有名的序列模型。

RNN中的注意力機制的原理可參考我寫的文章

那麼如何即解決長距離依賴困難問題，又解決并行運算的問題呢？Transformer就解決了這個問題。

不同種類attention的機制

Multi-layer perceptron（Bahdanau et al. 2015）

假設q是query，k是key，a表示相似度。那麼q和k的相似度就可以用以下公式來表示：

a ( q , k ) = W 2 T t a n h ( W 1 [ q ; k ] ) （ 1 ） a(q,k)=W_{2}^{T}tanh(W_{1}[q;k]) \qquad（1） a(q,k)=W2Ttanh(W1[q;k])（1）

其中 W 1 W_{1} W1和 W 2 T W_{2}^{T} W2T都是參數，是可以被訓練的。這個方法的好處是，它對于大型資料集是比較好的。

Bilinear（Luong et al. 2015）

假設q是query，k是key，a表示相似度。那麼q和k的相似度就可以用以下公式來表示：

a ( q , k ) = q T W k （ 2 ） a(q,k)=q^{T}Wk \qquad（2） a(q,k)=qTWk（2）

其中 W W W表示參數，是可以被訓練的

Dot Product（Luong et al. 2015）

假設q是query，k是key，a表示相似度。那麼q和k的相似度就可以用以下公式來表示：

a ( q , k ) = q T k （ 3 ） a(q,k)=q^{T}k \qquad （3） a(q,k)=qTk（3）

這個計算方法的好處是該計算方式不需要參數，但是需要q和k是同樣次元大小的，不然無法進行矩陣乘法計算。和Bilinear的計算方法不同在于，由于Bilinear中間加了一個參數W，是以q和k的次元不需要一緻。

Scaled Dot Product（Vaswani et al. 2017）

Dot Product将會帶來這樣的問題：我們都知道，softmax的函數的導數值的變化是先增加，後減小。也就是二次導數剛開始是大于0，接着是小于0。那麼當次元很大的時候，Dot Product的結果将會很大，就會導緻softmax函數處在二次導數小于0的位置，也就是曲線比較平滑的位置。這會帶來什麼問題呢？這會導緻反向傳播，求導的時候，導數很小。那麼就會導緻訓練速度變慢。是以，Scaled Dot Product就在分母上加了一個 k \sqrt{k} k

，計算方式也就變成了這樣子：

a ( q , k ) = q T k k （ 4 ） a(q,k)=\frac{q^{T}k}{\sqrt{k}} \qquad （4） a(q,k)=k

qTk（4）

這樣會使得 a ( q , k ) a(q,k) a(q,k)的值，往softmax函數的對稱點靠近。

論文中的self attention 中的Scaled Dot Product細緻講解

由于Transformer使用的就是這種attention的機制，是以我在這裡舉一個細緻一點的例子。文中所示的self attention機制流程圖是這樣的，

第5篇-《Attention Is All You Need》

這裡要注意，論文中的attention式子和式子（4）有一些不一樣，論文中是這樣的，

A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V （ 5 ） Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V \qquad （5） Attention(Q,K,V)=softmax(dk

QKT)V（5）

假設“Query=我已經閱讀了這篇部落格”，“Key=我已經閱讀了這篇部落格”，“Value=Key=我已經閱讀了這篇部落格”。在這裡式子（5）中的 d k \sqrt{d_k} dk

并不是對向量或者矩陣k進行開根号，在文中，作者使用的是分母中的 d k \sqrt{d_k} dk

指的是對每個token次元大小開根号。

這裡我們設定Query和Key一緻是有原因的，因為Transformer中使用的是self attention機制，self attention機制就是Query和Key是相等的。那麼此時，我們可以知道

K e y 0 = 我， K e y 1 = 已， K e y 2 = 經， K e y 3 = 閱， K e y 4 = 讀 Key_0=我，Key_1=已，Key_2=經，Key_3=閱，Key_4=讀 Key0=我，Key1=已，Key2=經，Key3=閱，Key4=讀

K e y 5 = 了， K e y 6 = 這， K e y 7 = 篇， K e y 8 = 博， K e y 9 = 博 Key_5=了，Key_6=這，Key_7=篇，Key_8=博，Key_9=博 Key5=了，Key6=這，Key7=篇，Key8=博，Key9=博

假設每個字使用512次元的向量來代表它，是以， d k \sqrt{d_k} dk

= 512 \sqrt{512} 512

。并且我們可以知道，Query是一個 10 ∗ 512 10*512 10∗512大小的矩陣，如下，

Q u e r y = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 （ 6 ） Query= \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} \qquad （6） Query=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512（6）

這裡要注意，我們這裡假設把token的向量表示為行向量，是以，上述中， w 1 , 1 w_{1,1} w1,1代表“我”這個字的向量中的第1個次元的值，w_{1,2}代表“我”這個字的向量中的第2個次元的值，w_{1,512}代表“我”這個字的向量中的第512個次元的值。依此類推， w 10 , 512 w_{10,512} w10,512代表“客”這個字的向量中的第512個次元的值。如果你想要把token的向量表示為列向量，也是沒問題的，後續的推導按正常進行推導也可以的。隻是計算結果中，行和清單示的意義交換了以下而已。

同理，

K e y = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 （ 7 ） Key= \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} \qquad （7） Key=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512（7）

V a l u e = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 （ 8 ） Value= \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} \qquad （8） Value=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512（8）

由式子（5）我們可以知道，

Q K T d k = Q K T 512 = Q u e r y ∗ K e y T 512 = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 ∗ [ w 1 , 1 w 2 , 1 w 3 , 1 . . . w 10 , 1 w 1 , 2 w 2 , 2 w 3 , 2 . . . w 10 , 2 w 1 , 3 w 2 , 3 w 3 , 3 . . . w 10 , 3 . . . . . . . . . . . . . . . w 1 , 512 w 2 , 512 w 3 , 512 . . . w 10 , 512 ] 512 ∗ 10 512 = [ ( w 1 , 1 ∗ w 1 , 1 + w 1 , 2 ∗ w 1 , 2 + . . . + w 1 , 512 ∗ w 1 , 512 ) / 512 . . . ( w 1 , 1 ∗ w 10 , 1 + w 1 , 2 ∗ w 10 , 2 + . . . + w 1 , 512 ∗ w 10 , 512 ) / 512 . . . . . . . . . . . . . . . . . . ( w 10 , 1 ∗ w 1 , 1 + w 10 , 2 ∗ w 1 , 2 + . . . + w 10 , 512 ∗ w 1 , 512 ) / 512 . . . ( w 10 , 1 ∗ w 10 , 1 + w 10 , 2 ∗ w 10 , 2 + . . . + w 10 , 512 ∗ w 10 , 512 ) / 512 ] 10 ∗ 10 （ 9 ） \begin {aligned} &\frac{QK^T}{\sqrt{d_k}} =\frac{QK^T}{\sqrt{512}}=\frac{Query*Key^{T}}{\sqrt{512}}\\ &=\frac { \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} * \begin{bmatrix} w_{1,1} & w_{2,1} & w_{3,1} & ... & w_{10,1} \\ w_{1,2} & w_{2,2} & w_{3,2} & ... & w_{10,2} \\ w_{1,3} & w_{2,3} & w_{3,3} & ... & w_{10,3} \\ ... & ... & ... & ... & ... \\ w_{1,512} & w_{2,512} & w_{3,512} & ... & w_{10,512} \end{bmatrix}_{512*10} } {\sqrt{512}} \\ &=\begin{bmatrix} (w_{1,1}*w_{1,1}+w_{1,2}*w_{1,2}+...+w_{1,512}*w_{1,512})/\sqrt{512} & ... & (w_{1,1}*w_{10,1}+w_{1,2}*w_{10,2}+...+w_{1,512}*w_{10,512})/\sqrt{512} \\ ... & ... & ... \\ ... & ... & ... \\ (w_{10,1}*w_{1,1}+w_{10,2}*w_{1,2}+...+w_{10,512}*w_{1,512})/\sqrt{512} & ... & (w_{10,1}*w_{10,1}+w_{10,2}*w_{10,2}+...+w_{10,512}*w_{10,512})/\sqrt{512} \end{bmatrix} \end{aligned}_{10*10} \qquad （9） dk

QKT=512

Query∗KeyT=512

⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512∗⎣⎢⎢⎢⎢⎡w1,1w1,2w1,3...w1,512w2,1w2,2w2,3...w2,512w3,1w3,2w3,3...w3,512...............w10,1w10,2w10,3...w10,512⎦⎥⎥⎥⎥⎤512∗10=⎣⎢⎢⎡(w1,1∗w1,1+w1,2∗w1,2+...+w1,512∗w1,512)/512

......(w10,1∗w1,1+w10,2∗w1,2+...+w10,512∗w1,512)/512

............(w1,1∗w10,1+w1,2∗w10,2+...+w1,512∗w10,512)/512

......(w10,1∗w10,1+w10,2∗w10,2+...+w10,512∗w10,512)/512

⎦⎥⎥⎤10∗10（9）

上式結果中，第1行的第1個元素代表“我”這個行向量與“我”這個行向量經過轉置後所得的向量相乘的結果，也就代表了“我”與“我”的相似度。第1行的最後一個元素代表“我”與“客”的點積結果，也就代表了“我”與“客”的相似度。是以，第一行就代表了“我”這個字與“我已經閱讀了這篇部落格”這句話，所有的字的一個内積（相似度）計算的結果。同理，每一行都代表了對應的那個字和所有的字的一個内積（相似度）計算的結果。

然後，接着，我們對上述的結果，做一次softmax操作，注意，softmax進行歸一化的時候，是一行一行的去計算的，假設我們對式子（9）計算的結果為，

A = [ a 1 , 1 a 1 , 2 . . . a 1 , 10 a 2 , 1 a 2 , 2 . . . a 2 , 10 . . . . . . . . . . . . a 10 , 1 a 10 , 2 . . . a 10 , 10 ] 10 ∗ 10 A=\begin {bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,10}\\ a_{2,1} & a_{2,2} & ... & a_{2,10}\\ ... & ... & ...& ... \\ a_{10,1} & a_{10,2} & ... & a_{10,10} \end {bmatrix}_{10*10} A=⎣⎢⎢⎡a1,1a2,1...a10,1a1,2a2,2...a10,2............a1,10a2,10...a10,10⎦⎥⎥⎤10∗10

上式結果就更加清晰了，

a 1 , 1 a_{1,1} a1,1代表“我”和“我”歸一化後的相似度

a 1 , 2 a_{1,2} a1,2代表“我”和“已”歸一化後的相似度

a 1 , 10 a_{1,10} a1,10代表“我”和“客”歸一化後的相似度

a 10 , 1 a_{10,1} a10,1代表“客”和“我”歸一化後的相似度

a 10 , 2 a_{10,2} a10,2代表“客”和“已”歸一化後的相似度

a 10 , 10 a_{10,10} a10,10代表“客”和“客”歸一化後的相似度

大家可以思考下， a 1 , 10 a_{1,10} a1,10和 a 10 , 1 a_{10,1} a10,1的值是否是一樣的呢？我給出的答案是一樣的。

有了A的結果後，我們會将A和Value繼續進行向量乘法的操作。

A t t e n t i o n ( Q , K , V ) = A ∗ V a l u e = [ a 1 , 1 a 1 , 2 . . . a 1 , 10 a 2 , 1 a 2 , 2 . . . a 2 , 10 . . . . . . . . . . . . a 10 , 1 a 10 , 2 . . . a 10 , 10 ] 10 ∗ 10 ∗ [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 = [ ( a 1 , 1 ∗ w 1 , 1 + a 1 , 2 ∗ w 2 , 1 + . . . + a 1 , 10 ∗ w 10 , 1 ) . . . ( a 1 , 1 ∗ w 1 , 512 + a 1 , 2 ∗ w 2 , 512 + . . . + a 1 , 10 ∗ w 10 , 512 ) . . . . . . . . . . . . . . . . . . ( a 10 , 1 ∗ w 1 , 1 + a 10 , 2 ∗ w 2 , 1 + . . . + a 10 , 10 ∗ w 10 , 1 ) . . . ( a 10 , 1 ∗ w 1 , 512 + a 10 , 2 ∗ w 2 , 512 + . . . + a 10 , 10 ∗ w 10 , 512 ) ] 10 ∗ 512 （ 10 ） \begin {aligned} &Attention(Q,K,V) =A*Value\\ &= \begin {bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,10}\\ a_{2,1} & a_{2,2} & ... & a_{2,10}\\ ... & ... & ...& ... \\ a_{10,1} & a_{10,2} & ... & a_{10,10} \end {bmatrix}_{10*10} * \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512}\\ &=\begin{bmatrix} (a_{1,1}*w_{1,1}+a_{1,2}*w_{2,1}+...+a_{1,10}*w_{10,1}) & ... & (a_{1,1}*w_{1,512}+a_{1,2}*w_{2,512}+...+a_{1,10}*w_{10,512}) \\ ... & ... & ... \\ ... & ... & ... \\ (a_{10,1}*w_{1,1}+a_{10,2}*w_{2,1}+...+a_{10,10}*w_{10,1}) & ... & (a_{10,1}*w_{1,512}+a_{10,2}*w_{2,512}+...+a_{10,10}*w_{10,512}) \end{bmatrix}_{10*512} \end{aligned} \qquad （10） Attention(Q,K,V)=A∗Value=⎣⎢⎢⎡a1,1a2,1...a10,1a1,2a2,2...a10,2............a1,10a2,10...a10,10⎦⎥⎥⎤10∗10∗⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512=⎣⎢⎢⎡(a1,1∗w1,1+a1,2∗w2,1+...+a1,10∗w10,1)......(a10,1∗w1,1+a10,2∗w2,1+...+a10,10∗w10,1)............(a1,1∗w1,512+a1,2∗w2,512+...+a1,10∗w10,512)......(a10,1∗w1,512+a10,2∗w2,512+...+a10,10∗w10,512)⎦⎥⎥⎤10∗512（10）

上式中的 A t t e n t i o n ( Q , K , V ) Attention(Q,K,V) Attention(Q,K,V)結果中，比如，第一行的第一個元素，

( a 1 , 1 ∗ w 1 , 1 + a 1 , 2 ∗ w 2 , 1 + . . . + a 1 , 10 ∗ w 10 , 1 ) (a_{1,1}*w_{1,1}+a_{1,2}*w_{2,1}+...+a_{1,10}*w_{10,1}) (a1,1∗w1,1+a1,2∗w2,1+...+a1,10∗w10,1)

其中 a 1 , 1 ∗ w 1 , 1 a_{1,1}*w_{1,1} a1,1∗w1,1這一項， a 1 , 1 a_{1,1} a1,1代表了“我”與“我”的相似度，而 w 1 , 1 w_{1,1} w1,1代表了Value中“我”的第1個次元的值大小，同理， a 1 , 10 ∗ w 10 , 1 a_{1,10}*w_{10,1} a1,10∗w10,1這一項， a 1 , 10 a_{1,10} a1,10代表了“我”與“客”的相似度， w 10 , 1 w_{10,1} w10,1代表了Value中“客”的第1個次元的值大小。是以， A t t e n t i o n ( Q , K , V ) Attention(Q,K,V) Attention(Q,K,V)結果中，第1行的第1個元素的結果的意義就代表，經過self attention後，Query中的“我”經過與Value attention後，其結果向量中，第1個次元所應當的計算的值。範圍再擴大點，也就是說， A t t e n t i o n ( Q , K , V ) Attention(Q,K,V) Attention(Q,K,V)結果中，第1行代表了Query中的“我”經過與Value attention後，其結果所應該表示的值。

至此，論文中self attetion的機制就講完了。

論文中的Multi-Head Attention細緻講解

文章中不僅僅使用了Scaled Dot Product Attention技術，還使用了Multi-Head Attention技術。其圖示原理如下：

第5篇-《Attention Is All You Need》

上節中，我們是以 d k = 512 d_k=512 dk=512次元進行self attention計算的，實際上，原文以及原文所開源出來的代碼并不是直接在 d k = 512 d_k=512 dk=512的基礎上進行self attention計算的。它的計算方法按如下步驟進行：

①首先，将Q、K、V分别線性映射分别8次（也就是論文中所說的h times，h=8），将每個字的次元從512次元映射到64次元。這裡會生成8組Q、K、V。此時 d k d_k dk就會變成64。

②然後，對8組的Q、K、V的每一組都做一次上節中所講述的self attention操作，得到8組結果。

③然後，将②中8組的結果concatenate一下，其結果中，每個字的次元就變回至512次元。

④至此，Multi-Head Attention操作結束了。

下面，我們用具體的例子來說明。假設輸入的句子是“我已經閱讀了這篇部落格”，那麼也就是說，“Query=我已經閱讀了這篇部落格”，“Key=我已經閱讀了這篇部落格”，“Value=我已經閱讀了這篇部落格”。

Q u e r y = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 （ 11 ） Query= \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} \qquad （11） Query=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512（11）

K e y = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 （ 12 ） Key= \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} \qquad （12） Key=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512（12）

V a l u e = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 （ 13 ） Value= \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} \qquad （13） Value=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512（13）

①首先将Q、K、V映射8次，由于每次映射的參數矩陣值都是不一樣的，是以我們可以得到8組Q、K、V矩陣。如下，i從1至8周遊，

Q i P = Q ∗ W i Q = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 ∗ [ t 1 , 1 Q , i t 1 , 2 Q , i t 1 , 3 Q , i . . . t 1 , 64 Q , i t 2 , 1 Q , i t 2 , 2 Q , i t 2 , 3 Q , i . . . t 2 , 64 Q , i t 3 , 1 Q , i t 3 , 2 Q , i t 3 , 3 Q , i . . . t 3 , 64 Q , i . . . . . . . . . . . . . . . t 512 , 1 Q , i t 512 , 2 Q , i t 512 , 3 Q , i . . . t 512 , 64 Q , i ] 512 ∗ 64 = [ w 1 , 1 ∗ t 1 , 1 Q , i + . . . + w 1 , 512 ∗ t 512 , 1 Q , i . . . . . . . . . w 1 , 1 ∗ t 1 , 64 Q , i + . . . + w 1 , 512 ∗ t 512 , 64 Q , i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . w 10 , 1 ∗ t 1 , 1 Q , i + . . . + w 10 , 512 ∗ t 512 , 1 Q , i . . . . . . . . . w 10 , 1 ∗ t 1 , 64 Q , i + . . . + w 10 , 512 ∗ t 512 , 64 Q , i ] 10 ∗ 64 \begin {aligned} Q_{i}^{P}&=Q*W_i^Q \\ &=\begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} * \begin{bmatrix} t_{1,1}^{Q,i} & t_{1,2}^{Q,i} & t_{1,3}^{Q,i} & ... & t_{1,64}^{Q,i} \\ t_{2,1}^{Q,i} & t_{2,2}^{Q,i} & t_{2,3}^{Q,i} & ... & t_{2,64}^{Q,i} \\ t_{3,1}^{Q,i} & t_{3,2}^{Q,i} & t_{3,3}^{Q,i} & ... & t_{3,64}^{Q,i} \\ ... & ... & ... & ... & ... \\ t_{512,1}^{Q,i} & t_{512,2}^{Q,i} & t_{512,3}^{Q,i} & ... & t_{512,64}^{Q,i} \end{bmatrix}_{512*64} \\ &= \begin{bmatrix} w_{1,1}*t_{1,1}^{Q,i}+...+w_{1,512}*t_{512,1}^{Q,i} & ... & ... & ... & w_{1,1}*t_{1,64}^{Q,i}+...+w_{1,512}*t_{512,64}^{Q,i} \\ ... & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ w_{10,1}*t_{1,1}^{Q,i}+...+w_{10,512}*t_{512,1}^{Q,i} & ... & ... & ... & w_{10,1}*t_{1,64}^{Q,i}+...+w_{10,512}*t_{512,64}^{Q,i} \end{bmatrix}_{10*64} \end{aligned} QiP=Q∗WiQ=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512∗⎣⎢⎢⎢⎢⎢⎡t1,1Q,it2,1Q,it3,1Q,i...t512,1Q,it1,2Q,it2,2Q,it3,2Q,i...t512,2Q,it1,3Q,it2,3Q,it3,3Q,i...t512,3Q,i...............t1,64Q,it2,64Q,it3,64Q,i...t512,64Q,i⎦⎥⎥⎥⎥⎥⎤512∗64=⎣⎢⎢⎢⎢⎡w1,1∗t1,1Q,i+...+w1,512∗t512,1Q,i.........w10,1∗t1,1Q,i+...+w10,512∗t512,1Q,i.............................................w1,1∗t1,64Q,i+...+w1,512∗t512,64Q,i.........w10,1∗t1,64Q,i+...+w10,512∗t512,64Q,i⎦⎥⎥⎥⎥⎤10∗64

K i P = K ∗ W i K = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 ∗ [ t 1 , 1 K , i t 1 , 2 K , i t 1 , 3 K , i . . . t 1 , 64 K , i t 2 , 1 K , i t 2 , 2 K , i t 2 , 3 K , i . . . t 2 , 64 K , i t 3 , 1 K , i t 3 , 2 K , i t 3 , 3 K , i . . . t 3 , 64 K , i . . . . . . . . . . . . . . . t 512 , 1 K , i t 512 , 2 K , i t 512 , 3 K , i . . . t 512 , 64 K , i ] 512 ∗ 64 = [ w 1 , 1 ∗ t 1 , 1 K , i + . . . + w 1 , 512 ∗ t 512 , 1 K , i . . . . . . . . . w 1 , 1 ∗ t 1 , 64 K , i + . . . + w 1 , 512 ∗ t 512 , 64 K , i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . w 10 , 1 ∗ t 1 , 1 K , i + . . . + w 10 , 512 ∗ t 512 , 1 K , i . . . . . . . . . w 10 , 1 ∗ t 1 , 64 K , i + . . . + w 10 , 512 ∗ t 512 , 64 K , i ] 10 ∗ 64 \begin {aligned} K_{i}^{P}&=K*W_i^K \\ &=\begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} * \begin{bmatrix} t_{1,1}^{K,i} & t_{1,2}^{K,i} & t_{1,3}^{K,i} & ... & t_{1,64}^{K,i} \\ t_{2,1}^{K,i} & t_{2,2}^{K,i} & t_{2,3}^{K,i} & ... & t_{2,64}^{K,i} \\ t_{3,1}^{K,i} & t_{3,2}^{K,i} & t_{3,3}^{K,i} & ... & t_{3,64}^{K,i} \\ ... & ... & ... & ... & ... \\ t_{512,1}^{K,i} & t_{512,2}^{K,i} & t_{512,3}^{K,i} & ... & t_{512,64}^{K,i} \end{bmatrix}_{512*64} \\ &= \begin{bmatrix} w_{1,1}*t_{1,1}^{K,i}+...+w_{1,512}*t_{512,1}^{K,i} & ... & ... & ... & w_{1,1}*t_{1,64}^{K,i}+...+w_{1,512}*t_{512,64}^{K,i} \\ ... & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ w_{10,1}*t_{1,1}^{K,i}+...+w_{10,512}*t_{512,1}^{K,i} & ... & ... & ... & w_{10,1}*t_{1,64}^{K,i}+...+w_{10,512}*t_{512,64}^{K,i} \end{bmatrix}_{10*64} \end{aligned} KiP=K∗WiK=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512∗⎣⎢⎢⎢⎢⎢⎡t1,1K,it2,1K,it3,1K,i...t512,1K,it1,2K,it2,2K,it3,2K,i...t512,2K,it1,3K,it2,3K,it3,3K,i...t512,3K,i...............t1,64K,it2,64K,it3,64K,i...t512,64K,i⎦⎥⎥⎥⎥⎥⎤512∗64=⎣⎢⎢⎢⎢⎡w1,1∗t1,1K,i+...+w1,512∗t512,1K,i.........w10,1∗t1,1K,i+...+w10,512∗t512,1K,i.............................................w1,1∗t1,64K,i+...+w1,512∗t512,64K,i.........w10,1∗t1,64K,i+...+w10,512∗t512,64K,i⎦⎥⎥⎥⎥⎤10∗64

V i P = V ∗ W i V = [ w 1 , 1 w 1 , 2 w 1 , 3 . . . w 1 , 512 w 2 , 1 w 2 , 2 w 2 , 3 . . . w 2 , 512 w 3 , 1 w 3 , 2 w 3 , 3 . . . w 3 , 512 . . . . . . . . . . . . . . . w 10 , 1 w 10 , 2 w 10 , 3 . . . w 10 , 512 ] 10 ∗ 512 ∗ [ t 1 , 1 V , i t 1 , 2 V , i t 1 , 3 V , i . . . t 1 , 64 V , i t 2 , 1 V , i t 2 , 2 V , i t 2 , 3 V , i . . . t 2 , 64 V , i t 3 , 1 V , i t 3 , 2 V , i t 3 , 3 V , i . . . t 3 , 64 V , i . . . . . . . . . . . . . . . t 512 , 1 V , i t 512 , 2 V , i t 512 , 3 V , i . . . t 512 , 64 V , i ] 512 ∗ 64 = [ w 1 , 1 ∗ t 1 , 1 V , i + . . . + w 1 , 512 ∗ t 512 , 1 V , i . . . . . . . . . w 1 , 1 ∗ t 1 , 64 V , i + . . . + w 1 , 512 ∗ t 512 , 64 V , i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . w 10 , 1 ∗ t 1 , 1 V , i + . . . + w 10 , 512 ∗ t 512 , 1 V , i . . . . . . . . . w 10 , 1 ∗ t 1 , 64 V , i + . . . + w 10 , 512 ∗ t 512 , 64 V , i ] 10 ∗ 64 \begin {aligned} V_{i}^{P}&=V*W_i^V \\ &=\begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & ... & w_{1,512} \\ w_{2,1} & w_{2,2} & w_{2,3} & ... & w_{2,512} \\ w_{3,1} & w_{3,2} & w_{3,3} & ... & w_{3,512} \\ ... & ... & ... & ... & ... \\ w_{10,1} & w_{10,2} & w_{10,3} & ... & w_{10,512} \end{bmatrix}_{10*512} * \begin{bmatrix} t_{1,1}^{V,i} & t_{1,2}^{V,i} & t_{1,3}^{V,i} & ... & t_{1,64}^{V,i} \\ t_{2,1}^{V,i} & t_{2,2}^{V,i} & t_{2,3}^{V,i} & ... & t_{2,64}^{V,i} \\ t_{3,1}^{V,i} & t_{3,2}^{V,i} & t_{3,3}^{V,i} & ... & t_{3,64}^{V,i} \\ ... & ... & ... & ... & ... \\ t_{512,1}^{V,i} & t_{512,2}^{V,i} & t_{512,3}^{V,i} & ... & t_{512,64}^{V,i} \end{bmatrix}_{512*64} \\ &= \begin{bmatrix} w_{1,1}*t_{1,1}^{V,i}+...+w_{1,512}*t_{512,1}^{V,i} & ... & ... & ... & w_{1,1}*t_{1,64}^{V,i}+...+w_{1,512}*t_{512,64}^{V,i} \\ ... & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ ... & ... & ... & ... & ... \\ w_{10,1}*t_{1,1}^{V,i}+...+w_{10,512}*t_{512,1}^{V,i} & ... & ... & ... & w_{10,1}*t_{1,64}^{V,i}+...+w_{10,512}*t_{512,64}^{V,i} \end{bmatrix}_{10*64} \end{aligned} ViP=V∗WiV=⎣⎢⎢⎢⎢⎡w1,1w2,1w3,1...w10,1w1,2w2,2w3,2...w10,2w1,3w2,3w3,3...w10,3...............w1,512w2,512w3,512...w10,512⎦⎥⎥⎥⎥⎤10∗512∗⎣⎢⎢⎢⎢⎢⎡t1,1V,it2,1V,it3,1V,i...t512,1V,it1,2V,it2,2V,it3,2V,i...t512,2V,it1,3V,it2,3V,it3,3V,i...t512,3V,i...............t1,64V,it2,64V,it3,64V,i...t512,64V,i⎦⎥⎥⎥⎥⎥⎤512∗64=⎣⎢⎢⎢⎢⎡w1,1∗t1,1V,i+...+w1,512∗t512,1V,i.........w10,1∗t1,1V,i+...+w10,512∗t512,1V,i.............................................w1,1∗t1,64V,i+...+w1,512∗t512,64V,i.........w10,1∗t1,64V,i+...+w10,512∗t512,64V,i⎦⎥⎥⎥⎥⎤10∗64

也就是說，我們得到了8組Q、K、V，它們分别是 ( Q 1 P , K 1 P , V 1 P ) (Q_{1}^{P}, K_{1}^{P}, V_{1}^{P}) (Q1P,K1P,V1P)， ( Q 2 P , K 2 P , V 2 P ) (Q_{2}^{P}, K_{2}^{P}, V_{2}^{P}) (Q2P,K2P,V2P)，…， ( Q 8 P , K 8 P , V 8 P ) (Q_{8}^{P}, K_{8}^{P}, V_{8}^{P}) (Q8P,K8P,V8P)。接着，我們對這8組各自做self attention操作，根據 A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dk

QKT)V這個公式，我們可以分别得到8個矩陣，它們大小為10*64。，即如下，i從1周遊至8，

A t t e n t i o n i = [ a t t e n t i o n 1 , 1 i a t t e n t i o n 1 , 2 i a t t e n t i o n 1 , 3 i . . . a t t e n t i o n 1 , 64 i a t t e n t i o n 2 , 1 i a t t e n t i o n 2 , 2 i a t t e n t i o n 2 , 3 i . . . a t t e n t i o n 2 , 64 i a t t e n t i o n 3 , 1 i a t t e n t i o n 3 , 2 i a t t e n t i o n 3 , 3 i . . . a t t e n t i o n 3 , 64 i . . . . . . . . . . . . . . . a t t e n t i o n 10 , 1 i a t t e n t i o n 10 , 2 i a t t e n t i o n 10 , 3 i . . . a t t e n t i o n 10 , 64 i ] 10 ∗ 64 （ 14 ） Attention_i= \begin{bmatrix} attention_{1,1}^i & attention_{1,2}^i & attention_{1,3}^i & ... & attention_{1,64}^i \\ attention_{2,1}^i & attention_{2,2}^i & attention_{2,3}^i & ... & attention_{2,64}^i \\ attention_{3,1}^i & attention_{3,2}^i & attention_{3,3}^i & ... & attention_{3,64}^i \\ ... & ... & ... & ... & ... \\ attention_{10,1}^i & attention_{10,2}^i & attention_{10,3}^i & ... & attention_{10,64}^i \end{bmatrix}_{10*64} \qquad （14） Attentioni=⎣⎢⎢⎢⎢⎡attention1,1iattention2,1iattention3,1i...attention10,1iattention1,2iattention2,2iattention3,2i...attention10,2iattention1,3iattention2,3iattention3,3i...attention10,3i...............attention1,64iattention2,64iattention3,64i...attention10,64i⎦⎥⎥⎥⎥⎤10∗64（14）

上面矩陣大小 10 ∗ 64 10*64 10∗64中的10代表“我已經閱讀了這篇部落格”這句話的句長。

接着，我們對每一個 A t t e n t i o n i Attention_i Attentioni進行拼接，也就是将 A t t e n t i o n 1 Attention_1 Attention1， A t t e n t i o n 2 Attention_2 Attention2，…， A t t e n t i o n 8 Attention_8 Attention8，進行拼接。拼接後總的 A t t e n t i o n Attention Attention結果如下，

A t t e n t i o n = [ a t t e n t i o n 1 , 1 1 . . . a t t e n t i o n 1 , 64 1 a t t e n t i o n 1 , 1 2 . . . a t t e n t i o n 1 , 64 2 . . . a t t e n t i o n 1 , 1 8 . . . a t t e n t i o n 1 , 64 8 a t t e n t i o n 2 , 1 1 . . . a t t e n t i o n 2 , 64 1 a t t e n t i o n 2 , 1 2 . . . a t t e n t i o n 2 , 64 2 . . . a t t e n t i o n 2 , 1 8 . . . a t t e n t i o n 2 , 64 8 a t t e n t i o n 3 , 1 1 . . . a t t e n t i o n 3 , 64 1 a t t e n t i o n 3 , 1 2 . . . a t t e n t i o n 3 , 64 2 . . . a t t e n t i o n 3 , 1 8 . . . a t t e n t i o n 3 , 64 8 . . . . . . . . . . . . . . . a t t e n t i o n 10 , 1 1 . . . a t t e n t i o n 10 , 64 1 a t t e n t i o n 10 , 1 2 . . . a t t e n t i o n 10 , 64 2 . . . a t t e n t i o n 10 , 1 8 . . . a t t e n t i o n 10 , 64 8 ] 10 ∗ 512 （ 15 ） Attention= \begin{bmatrix} attention_{1,1}^1 & ... &attention_{1,64}^1 & attention_{1,1}^2 & ... &attention_{1,64}^2 & ... &attention_{1,1}^8 & ... &attention_{1,64}^8 \\ attention_{2,1}^1 & ... &attention_{2,64}^1 & attention_{2,1}^2 & ... &attention_{2,64}^2 & ... &attention_{2,1}^8 & ... &attention_{2,64}^8 \\ attention_{3,1}^1 & ... &attention_{3,64}^1 & attention_{3,1}^2 & ... &attention_{3,64}^2 & ... &attention_{3,1}^8 & ... &attention_{3,64}^8 \\ ... & ... & ... & ... & ... \\ attention_{10,1}^1 & ... &attention_{10,64}^1 & attention_{10,1}^2 & ... &attention_{10,64}^2 & ... &attention_{10,1}^8 & ... &attention_{10,64}^8 \end{bmatrix}_{10*512} \qquad （15） Attention=⎣⎢⎢⎢⎢⎡attention1,11attention2,11attention3,11...attention10,11...............attention1,641attention2,641attention3,641...attention10,641attention1,12attention2,12attention3,12...attention10,12...............attention1,642attention2,642attention3,642attention10,642............attention1,18attention2,18attention3,18attention10,18............attention1,648attention2,648attention3,648attention10,648⎦⎥⎥⎥⎥⎤10∗512（15）

接着，我們對上述的 A t t e n t i o n Attention Attention結果做一次linear projection，其中的權重矩陣的大小為 512 ∗ 512 512*512 512∗512，權重系數的大小為 512 512 512。線性映射後的矩陣的大小為 10 ∗ 512 10*512 10∗512

至此，論文中的Multi-Head Attention子產品已經講解結束。

論文中的Position-wise Feed-Forward Networks細緻講解

論文中其實已經說的比較清楚了，這裡我也簡單的闡述下我的看法。

該層的函數是這樣的：

F F N ( x ) = m a x ( 0 , x W 1 + b 1 ) W 2 + B 2 FFN(x)=max(0, xW_1+b_1)W_2+B_2 FFN(x)=max(0,xW1+b1)W2+B2

由于論文使用了6層堆疊起來的encoder和decoder，是以，每一層中的每一個sub-layer的 W 1 W_1 W1， b 1 b_1 b1， W 2 W_2 W2， b 2 b_2 b2都是不一樣的。它等同于使用兩層卷積核進行卷積，卷積核大小為1*1。并且第一層卷積後的結果的次元是2048次元，即 d f f = 2048 d_{ff}=2048 dff=2048。這裡比較簡單，就不細講了。

論文中的Positional Encoding細緻講解

為了使該模型能夠學習到句子的位置資訊，在encoder的輸入部分，有一個Positional Encoding，這個位置編碼的次元是 d m o d e l d_{model} dmodel，也就是512次元。其公式為：

P E p o s , 2 i = s i n ( p o s 1000 0 2 i d m o d e l ) = s i n ( p o s 1000 0 2 i d 512 ) （ 16 ） \begin{aligned} PE_{pos,2i}&=sin(\frac {pos} {10000^{\frac {2i} {d_{model}}}}) \\ &=sin(\frac {pos} {10000^{\frac {2i} {d_{512}}}}) \end{aligned} \qquad （16） PEpos,2i=sin(10000dmodel2ipos)=sin(10000d5122ipos)（16）

P E p o s , 2 i + 1 = c o s ( p o s 1000 0 2 i d m o d e l ) = s i n ( p o s 1000 0 2 i d 512 ) （ 17 ） \begin{aligned} PE_{pos,2i+1}&=cos(\frac {pos} {10000^{\frac {2i} {d_{model}}}}) \\ &=sin(\frac {pos} {10000^{\frac {2i} {d_{512}}}}) \end{aligned} \qquad （17） PEpos,2i+1=cos(10000dmodel2ipos)=sin(10000d5122ipos)（17）

其中，pos是字在句子中的位置，i是該字對應的次元。是以，奇數位次元的位置向量使用公式（16）計算，偶數位的位置向量使用公式（17）來計算。

代碼複現、詳細講解及我的Github位址

完整代碼位址：https://github.com/haitaifantuan/nlp_paper_understand

第5篇-《Attention Is All You Need》

《Attention Is All You Need》閱讀心得分享

論文原文連結

論文導讀

序列模型介紹

不同種類attention的機制

Multi-layer perceptron（Bahdanau et al. 2015）

Bilinear（Luong et al. 2015）

Dot Product（Luong et al. 2015）

Scaled Dot Product（Vaswani et al. 2017）

論文中的self attention 中的Scaled Dot Product細緻講解

論文中的Multi-Head Attention細緻講解

論文中的Position-wise Feed-Forward Networks細緻講解

論文中的Positional Encoding細緻講解

代碼複現、詳細講解及我的Github位址

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

考證大全 | 證券從業資格考試

敲黑闆！2021年證券從業考試考點預測

2021年銀行從業考試考情介紹,果斷收藏!

證券從業合格證書什麼時候列印？有哪些注意事項？

【幹貨滿滿】初級銀行從業考試《個人理财》重點梳理

2020年經濟師考試，難嗎？

初級銀行從業資格證有什麼用？

MBA提前面試純幹貨分享

MBA值得學麼

吳恩達logistic回歸實作

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

深度學習模型分析人類複雜疾病的準确性

【趨高機器視覺】機器視覺技術原了解析及解決方案

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡