deeplearning筆記5：序列模型

week（一）循環序列模型

1.1為什麼選擇序列模型

在下圖中，列舉出了一些“序列模型”能夠應用的場景：

speech recognition：}X:語音片段，y: 語音片段}，supervised learning

music generation：{X:input可為空集，y:生成的音樂}，supervised learning

sentiment classification：{X:一段評論，y:writer的情感分數}，supervised learning

DNA sequence analysis：{X:一段基因片段，y:判斷基因是否為蛋白質…}，supervised learning

machine translation：{X:一段話，y:一段話}，supervised learning

video activity recognition：{X:一系列圖檔，y：人的動作}，supervised learning

name entity recognition:{X:一段話，y:名字}，supervised learning

deeplearning筆記5：序列模型

1.2 數學符号

本節中主要給出了“序列模型”中，用到的各個符号的意義，以及word在“序列模型”中的表示方法：

notation

在序列模型中：

x：表示一段話中的第i個單詞

y：表示第i個output

x(i)：表示第i個樣本

x(i)<t>：表示第i個樣本的第t個單詞

y(i)<t>：表示第i個樣本的第t個output

Tx：表示某樣本的單詞量

T(i)x：表示第i個樣本的單詞量

Ty：表示某個樣本的output的長度

T(i)y：表示第i個樣本output的長度

deeplearning筆記5：序列模型
序列模型中word的表示方法——> one-hot

在序列模型中，會建構一個“dictionary”，比如現有dictionary={a,harry,potter,and,hermione,granger,invented,new,spell}，則下述PPT中句子x中各單詞的表示方法為：以a為例：[1,0,0,0,0,0,0,0,0]。a的特征向量的長度與字典長度相同，其特征向量由0,1構成，除a在字典中出現的位置為1外，其它的位置均為0.

利用one-hot進行word表示，可能存在一個問題，即，句子中某個word在dictionary中沒有，針對這種情況，我們可以将dictionary中找不到的word标記為 (unkown)。

deeplearning筆記5：序列模型

1.3循環神經網絡模型

本節我們主要介紹一下“循環神經網絡”：

用“普通神經網絡”解決“name recognition task”的弊端

假設我們現在要解決一個"name recognition task"：即：給出一段話，識别這段話中的name，如果我們用“标準的神經網絡”來解決這個問題，流程如下圖所示：

input：為一段話x(i)=(x<1>,x<2>,…,x<Tx>)，passage中的每個word都用one-hot的方式能行表示。

output：為y(i)=(y<1>,y<2>,…,y<Ty>)，其中y<k>為第k個word是否為name，y<k>為第k個word為name的probability。

利用“傳統的神經網絡”進行fit，有2個弊端，如下：
在實際的application中，每段話（input）的長度Tx不一定相同，這使得無法固定“neural network”的input dimension，output同理。且在使用one-hot表示word時，input的dimension将是非常巨大的。
“普通的神經網絡”無法将前邊學到的知識應用于後來者，如：假設model對一段話中第一個“Harry”的預測結果為name，當model遇到該段話的第2個“Harry”時，他不能從前邊學到的知識推斷出Harry為name，而是要從頭開始學習Harry是否為name。

針對上述兩種downside，循環神經網絡都可以很好的解決。

deeplearning筆記5：序列模型
循環神經網絡結構

下圖所示為“循環神經網絡”得結構：

normal neural network：input會一次性的喂給neural network,output也會一次性的從model末尾輸出；normal neural network無法吸取學習經驗，每次學習都是從零開始。

recurrent neural network：一段話x(i)中的每個word x<j>是依次輸入recurrent neural network的hidden layer的，output y也是依次從每個hidden layer中輸出的；recurrent network可以吸取上一個hidden layer的學習經驗，将其應用于本次hidden layer的學習中。具體結構如下圖所示。

“基本的循環神經網絡”有一個缺點，它隻能獲得目前word之前，其它詞彙的學習經驗，而無法獲得後邊詞彙的學習經驗，如下圖中的2段話所示，在識别Teddy是否為name時，“循環神經網絡”隻能借鑒前邊已學單詞的學習經驗，而無法将Teddy後邊的單詞應用于Teddy的學習中。

deeplearning筆記5：序列模型

“循環神經網絡”的學習公式如下圖所示：

首先對notation進行定義：

Wax：W的第一個下标a，表示該W要計算的東西為a like quantity，W的第二個小标x，表示該W要乘的東西為x like quantity。

在“循環神經網絡”中涉及到3類parameter w，第一類為input的parameter Wax，第二類為activation的parameter Waa，第三類為output的parameter Wya。

在“循環神經網絡”中，input的active function常為tanh，ReLu；output的active function常為sigmoid或softmax（binary/multi classification）。

如下圖所示：

第l hidden layer的activation能夠應用第l-1 hidden layer的activation：a<l> = g(Waaa<l-1> + Waxx<l> + ba)；

第l hidden layer的output為：y<l> = g(Wya*a<l> + by)

一般情況下，我們把a<0>初始化為0。

note that：循環神經網絡每個hidden layer的參數Waa,Wax,Wya均相同。

deeplearning筆記5：序列模型

在接下來的幾節中，我們為了describe友善，會簡化一些notation，具體如下圖示：

Wa = [Waa | Wax]

Wy = Wya

deeplearning筆記5：序列模型

1.4 通過時間的反向傳播

上一節中，主要介紹了recurrent network的forward propagation，這一節，将主要講述recurrent network的backward propagation。

recap：在計算backward propagation時，需要用到forward propagation的一些已知數：a，z;

如下圖所示，為“循環神經網絡”的backward propagation：

在計算backward propagation時，最重要的是确定“目标函數”，recurrent network中hidden layer t的目标函數為：

L<t>(y<t>,y’<t>) = -y<t> logy’<t> - (1-y<t>) log(1-y’<t>)；其中y<t>為target value，y’<t>為prediction。

recurrent network的目标函數為各個hidden layer 目标函數的summation，如下圖所示：

知道目标函數後，我們即可用求導的方式，回溯各層hidden layer參數的更新值（類：neural network的backward propagation）。

deeplearning筆記5：序列模型

在前面幾節中，我們講的循環神經網絡，擁有相同次元的input和output，下面一節我們将講解更多其他形式的循環神經網絡。

1.5 不同類型的循環神經網絡

下圖中總結了一些常見的“循環神經網絡”模式：

1） one to one recurrent network;

2）one to many ：sequence generation model ，如：music generation；

3）many to one：如：sentiment classification；

4）many to many：如：name recognition；（input = output）

5）many to many：如：machine translation；（input != output）

其中，sequence generation model還有很多細節需要注意，将在下一節中詳細介紹。

deeplearning筆記5：序列模型

1.6 語言模型和序列生成

Language model是指：給定一個sentence，通過language model，可以判斷該sentence的probability。

簡介其訓練過程：

training data為corpus of text，假設text表示為（y<1>，y<2>,…y<Ty>），y為一個word的one-hot表示。

将training text的各個word l 依次輸入language model的各個hidden layer l+1，并求得各個hidden layer的相對應的word y的output softmax probability 表示，假設為p。

language model的目标函數是：求使得所有樣本的似然函數(p<1>p<2>…p…)之和最大時的參數W，即（利用最大似然函數求解W）。

note that：language model每層的參數均相同。

在獲得一個language model之後，即可預測某一input sentence的機率P(sentence)。
language model的應用：

1、machine translation：machine learning model的前半部分隻有input，而後半部分隻有output(即:language model)，在machine learning model的前半部分依次将各個input x輸入到各個hidden layer，在machine learning model的後半部分(language model)，依次輸出各個 prediction word y，并同時可得y的機率值yhat（在machine translation model中，output的是擁有最大softmax機率值的word）；

2、給定一句話，預測下一句話為某sentence的probability（個人了解，其model和machine translation model差不多，都是由兩部分“循環神經網絡”組成，即：input-model，output-model）。
Language model訓練步驟如下：

第一步：将training sequence tokenization，如：将下圖中第一個句子(cat…day)進行tokenization 為(y<1>,…,y)，每個y表示一個one-hot vector。

在這一步中有兩點需要注意：1）當一個sentence完結時，需要給其加’EOS’表示end of setence；2）當一個setence中出現dictionary中沒有的word時，将此word标記為UNK表示unkown word；

Tokenization如下PPT所示：

deeplearning筆記5：序列模型

對training sequence做完tokenization後，将其丢到RNN中進行training；

具體如下PPT所示：

deeplearning筆記5：序列模型
RNN中的input：a<0>=0,x<1>=0，将其丢給activation function，得到a<1>，然後在用output function softmax，得到

deeplearning筆記5：序列模型
,

deeplearning筆記5：序列模型
為一機率值。如果dictionary 中有10,000個word的話，則其次元為10,000。下一階段，将x<2>=y<1>，通過計算a<2>，得到RNN hidden layer2的softmax output probability，

deeplearning筆記5：序列模型
，依次類推，求所有hidden layer的output softmax probability ：

deeplearning筆記5：序列模型
，

deeplearning筆記5：序列模型

…

在得到這一系列output softmax probability 後，即可建構cost function，

RNN的cost function 為最大似然函數（求使得樣本序列機率最大時的參數W）：

deeplearning筆記5：序列模型

上述公式中，yhat<t>為一個單詞機率向量，次元與dictionary次元相同。y<t>表示time t的training sequence 對應word_t 的one-hot vector，二者乘積表示：在time t輸出word為word_t的機率值。

通過上述的cost function，利用backward propagation，即可求得“循環神經網絡：language model”的參數W。

Weight的具體求解方法如下：

1）初始化weight，求cost function導數

deeplearning筆記5：序列模型

；

2）利用gradient descent(backward propagation)求解weight： w = w-

deeplearning筆記5：序列模型

；

1.7 對新序列采樣

基于word-level language model進行“新序列采樣”

首先，利用一個text corpus訓練一“循環神經網絡：language model”，如下圖（1）所示（具體language model的訓練過程可以參見上一節）。

其次，在獲得訓練好的“language model”後，即可利用該網絡進行“新序列采樣”，具體過程如圖（2）：

在“新序列采樣”中，将用到一個function：numpy.random.choice随機選取内容，他可以根據一個機率分布，随機選取樣本。

“新序列采樣”過程如下：

首先，明确“language model”中a<0>，x<1>為0。

根據hidden layer1的output softmax distribution，利用np.random.choice()從dictionary中随機抽取樣本y<1>（為one-hot表示），并将其送入hidden layer2中，用于形成layer2的output softmax distribution（note that: dictionary中有多少word，softmax就會形成多少個probability）。

在hidden layer2中，x<2> = y<1>，根據layer2的output softmax distribution利用np.random.choice()從dictionay中随機選取一個sample，作為layer2的output word，y<2>，并将其送入hidden layer3，用于形成layer3的 output softmax distribution。

hidden layer3的操作與hidden layer2的相同。

重複上述操作，直到達到以下situation之一：

situation1：np.random.choice()抽到EOS，句子結束符，停止采樣。

situation2：采樣得到的word個數達到指定數目，停止采樣。

将各個hidden layer中随機選取的word按照時序輸出，即為根據language model随機采樣生成的 “序列”。

note that：上述訓練好的“language model”輸出的是yhat<l>，為dictionary中各個word的機率值。但是，送入到下一個hidden layer l+1 的不是機率值yhat<l>，而是word的one-hot vector: y<l>。

deeplearning筆記5：序列模型

與word-level language model對應的還有一個character-level language model，其dictionary為一系列character，而在word-level language model中其dictionary為一系列word。

與word-level language model相比，character-level language model具有以下的特點：

首先，在character-level language model中，其dictionary可以根據training text中的字母來建立（甚至也可将标點符号加入dictionary中）。

其次，在character-level language model中，text的表示方法（y<1>，y<2>，…，y），y指的并不是一個word，而是一個“字母”。

第三，在character-level language model中，再對training text進行tokenization時，不用擔心會遇到dictinary中沒有的word，而導緻無法求其probability，因為，不管什麼word都可以用字母表示。

第四，character-level language model最大的一個缺點在于，用它進行标記的text 次元将會很大（因為它是用字母，而非單詞，對text進行标記）。是以，比起word-level，character-level不善于捕捉“句子前後部分的互相依存關系”。

第五，與word-level相比，character-level language model計算量以及hardware的消耗量都非常龐大。是以，目前，工業界應用較多的依然為word-level language model，而character-level隻有在 text data包含很多dictionary中沒有的word情況下，才會應用較多。（還有其他應用場景，以後收集）

下圖為character-level language model：

deeplearning筆記5：序列模型

1.8 帶有神經網絡的梯度消失

vanishing gradient

vanishing gradient：使得你的RNN很難将離目前hidden layer l較遠的hidden layer l-n的資訊利用起來，舉例說明，如下圖所示：

圖中兩個句子如下：

tha cat … was…

the cats … were …

可以看出這兩個句子，其前後部分是有關聯的：cat/cats —— was/were

但是，在RNN中，由于gradient vanishing，使得通過cost function的backward propagation很難影響位于shallow layer的weight。因而，使得RNN很難将shallow layer學習到的東西應用到deeper layer。引起這種現象的原因稱為“vanishing gradient”。

deeplearning筆記5：序列模型
exploding gradient

exploding gradient：會使得weight變得非常大，可能造成RNN系統崩潰，其典型現象是在neural network computation時，可能造成“數值溢出”。對于exploding gradient的一個解決方案時：gradient clipping（其具體操作為：當gradient vector的value>threshold(maximum)時，對gradient vector進行縮放，保證它的value不會太大。）

在實際應用中，相比vanishing gradeint，exploding gradient更容易解決。在下一節中，我們将講解vanishing gradient的一個robust 解決方案：GRU單元。

1.9 GRU單元

GRU單元是對RNN 的 hidden layer進行一定modification，進而使得RNN能夠将shallow layer學習到的資訊應用到deeper layer，解決 vanishing gradient problem。

RNN visualization

下邊為RNN hidden layer的一個圖示：

其中，hidden layer l的active function為tanh，該hidden layer既可以output 下一個hidden layer l+1 的a<l+1>，也可以output一個softmax probability：yhat<l>。這種典型的RNNmodel，隻能利用前一個hidden layer l-1學習到的資訊：a<l-1>，但是，卻無法利用距其較遠的hidden layer的資訊，這種現象也稱vanishing gradient，即：随着time的進行，在a<0>上相乘的Wa，不斷累積：W<1>aW<2>a…W<l>a*a<0>，進而使得到達a<l>時，其a<0>對于a<l>的影響力已經很薄弱，進而無法獲知a<0>的知識。

為了解決上述問題，學者引入了GRU的概念。

deeplearning筆記5：序列模型
GRU 單元

以下為standard GRU圖示：

對RNN的hidden layer應用GRU，可以使得RNN将shallow layer資訊應用于deep layer中。

如下圖例句：the cat , … ,was …。

引入GRU後的RNN，在“was” 位置能夠捕捉到“cat”位置的資訊，進而，得出“was”，而非“were”的結論。

下面具體介紹GRU的工作原理：

首先，明确幾個概念：

c<t> = a<t>：a<t>為hidden layer t的active fucntion value。c<t>為hidden layer t 輸出的“記憶單元”。

ctilt<t> = tanh(Wc[c<t-1>,x<t>] + bc) ：ctilt<t>為 hidden layer t 中新引進的“記憶單元”（hidden layer t中輸出的“記憶單元”的候選值），用于計算hidden layer t輸出的c<t>。

gama_u = sigmoid(Wu[c<t-1>,x<t>] + bu) #gamma_u為一gate，用于決定是否更新整條RNN鍊上的“記憶單元”（c）。gama_u的取值介于0和1之間。

c<t> = gama_u * ctilt<t> + (1 - gama_u) * c^ ；假設gama_u=1，則hidden layer t輸出的“記憶單元”會完全去除前一個hidden layer記憶單元c^的影響，而隻記錄目前hidden layer的“知識”，用于後序hidden layer的引用。通過每層hidden layer 的gama_u可以決定目前hidden layer的記憶單元要包括的部分（前一層的記憶，目前層的記憶）。

以下圖例句說明：

當hidden layer = ‘cat’時，其輸出的資訊單元包括’cat’。

當hidden layer='which’時，根據目前層gama_u的計算值，決定保留’cat’資訊。

…

一直到hidden layer = 'was’時，目前層的activation更新應用到了‘cat’資訊，并在應用之後，根據目前層gama_u，決定釋放’cat’資訊。其輸出的“資訊單元”中将不再包含’cat’的資訊。

deeplearning筆記5：序列模型

在前邊的GRU中，ctilt<t>的計算（ctilt<t> = tanh(Wc[c<t-1>,x<t>] + bc)），百分百會包含c<t-1>的資訊。在接下來的PPT中，我們将對ctilt<t>的計算公式進行調整，使ctilt<t>對于c<t-1>的依賴與否，也由一個gate gama_r決定。具體請看以下PPT：

ctilt<t> = tanh(Wc[gama_r * c<t-1>,x<t>] + bc)

gama_r = sigmoid(Wr[c<t-1>,x<t>] + br)

deeplearning筆記5：序列模型
個人了解：引入GRU的RNN，cost function依然為“最大似然估計”。參數求解方法依然為backward propagation。

除GRU外，學者也引入了很多其他的version of GRU ，來獲得一個longer range dependence（即：使得RNN能夠儲存更長序列的hidden layer的資訊，共deeper layer使用），具體，自行檢視文獻。下一節，将講述GRU的一個變種，LSTM。

1.10 長短期記憶（LSTM）

LSTM與GRU不同的幾點：

0、LSTM中有3個gate gama_o，gama_u，gama_f。GRU中隻有2個gate gama_u，gama_r。

1、在LSTM中，a<t>将不再等于c<t>。而是引入output gate gama_o，使得a<t> = gama_o * c<t>。

2、在LSTM中，c<t>中對于c<t-1>和ctilt<t>的權衡，将不再僅用一個gama_u來決定，而變成了分别用gama_u和game_f來決定c<t>中，是否引入c<t-1>和ctilt<t>這兩種記憶單元，即：c<t> = gama_u * ctilt<t> + gama_f * c<t-1>。

3、GRU的優點在于，他較LSTM是一個simpler model，是以，應用于network時，可以建構一個更深的neural network。而LSTM則more powerful,more flexible。

下圖展示了GRU和LSTM的差別：

deeplearning筆記5：序列模型

下圖為LSTM公式的總結，以及圖示：

除圖中LSTM外，一些學者還引入了LSTM的一個變種：

其差别具體展現在3個gate的求解上：

gama_i = sigmoid(Wi[a<t-1>,x<t>,c<t-1>] + bi) #即gate的計算中引入了變量c<t-1>，gate這種計算形式被稱為peepble connections。

deeplearning筆記5：序列模型

question：圖中LSTM圖示中，a<t> = gama_o * tanh(c<t>)，而非a<t> = gama_o * c<t>。

如果在實際中，要在GRU和LSTM中，選一款的話，目前，大多數人會優先選LSTM，因為，它曆史悠久，久經試煉。

1.11 雙向神經網絡(BRNN:bidirectional neural network)

雙向神經網絡建構

所謂“雙向神經網絡”，是指對于某一hidden layer的預測，它不僅可以“吸取shallow layer的學習資訊”，而且也能“吸取deep layer的學習資訊”。需要注意的是，進行“雙向神經網絡”訓練，需要得到完整的sample資訊，對應到speech recognition中，你需要等speaker說完以後，才能對其說話内容進行預測。而在NLP中，你要得到一個完整sample後，才可用“雙向神經網絡”對其進行分析。

“雙向神經網絡”流程具體如下圖所示：

其前向神經的計算如下：

hidden layer1 :a-><1> = g(Wax* x<1> + ba)

hidden layer2:a-><2> = g(Wa[a-><1>, x<2>] + ba)

hidden layer3:a-><3> = g(Wa[a-><2>, x<3>] + ba)

hidden layer4:a-><4> = g(Wa[a-><3>, x<4>] + ba)

其後向神經的計算如下：

hidden layer4:a<-<4> = g(Wax* x<4> + ba)

hidden layer3:a<-<3> = g(Wa[a<-<4>, x<3>] + ba)

hidden layer2:a<-<2> = g(Wa[a<-<3>, x<2>] + ba)

hidden layer1:a<-<1> = g(Wa[a<-<2>, x<1>] + ba)

計算hidden layer i的output yhat :yhat=g(Wy[a-><t>, a<-<t>] + by)。yhat同時兼顧了shallow layer 的資訊和 deep layer 的資訊。

舉例說明：

He said ‘Teddy Recaoult’…

預測Teddy是否為name，如果用“單項RNN”，則其隻能利用Teddy前邊的資訊進行判斷，而無法利用其後邊的資訊。而如果用“雙向RNN”，則其不僅能利用“Teddy”前邊的資訊（he said），也能利用其後邊的資訊（Recaoult）。

question：在“前向神經”和“後向神經”的計算中，其參數W，b，是否相同？

deeplearning筆記5：序列模型

BRNN中參數的求解

BRNN中，cost function為“最大似然估計”，參數求解方法為“backward pro

pagation。

在求“前向神經”參數時，其forward方向為left -> right，是以，其backward propagation的方向為：right -> left。

在求“後向神經”參數時，其forward方向為right -> left，是以，其backward propagation的方向為：left -> right。

note that：在NLP問題中，常用“雙向神經網絡 + LSTM” model。

1.12 深層循環神經網絡

相比“深層卷積網絡”，“深層循環網絡”一般可達的layer 數量要小很多，這是因為，RNN中每層layer中，還有“時間序列 layer”，由于這個原因，即便RNN層數較少，其總得layer數量也已經很多。

下圖，介紹了“深層循環神經網絡”的建構方法：

首先，介紹notation：

a[l]：[l]表示RNN的layer數，表示hidden layer l中的第i個時間點。a[l]表示hidden layer l中第i個時間點的activation。

以a[2]<3>為例介紹各層layer的activation的計算方法：

a[2]<3> = g(W[2]a * [a[2]<2>，a[1]<3>] + b[2]a)；

有些學者在“深層循環網絡”中，每個時間點t的output會加一些“deep neural network”用以預測output y，如下圖所示：

deeplearning筆記5：序列模型

在“深層循環網絡”中，每個unit的設定可以引入GRU，LSTM等modification，進而增強“深層循環神經網絡”将shallower layer資訊應用于deeper layer資訊的能力（to achieve longer range dependence）。

week(二) 自然語言處理與嵌入

2.1 詞彙表征

one-hot vector VS word embeding algorithm

在week 一中，我們介紹了一種表征word的方法，即：one-hot vector，其形式如下圖所示：

所謂one-hot vector是指：為表征word我們首先建構一個dictionary，然後将word表示為由0 ， 1構成的n維向量（n=length of dictionary），word one-hot vector中的元素除word在dictionary中所在位置為1外，其餘均為0.

這種word表示方法的一個顯著缺點是，他無法表征兩個word之間的相關關系，因為，每兩個word之間的inner product均為0。

deeplearning筆記5：序列模型

學者為解決one-hot vector面臨的這種問題，提出了一個表征word的新方法，即：word embedding，利用word embedding algorithm能夠獲得word的featurized representation，他能有效表示兩個word之間的相關程度，具體如下圖所示：

圖中word的特征向量是在（gender,royal,age,…,food,…）等多個次元的一個表示，其每個次元的value都代表着：該word在這一次元(如：gender)上的“顯性程度”，value越大，說明，word在該次元的屬性越明顯，value越小，說明，word與該次元的相關性越低。

從下圖可以看出，apple和orange的inner product較大，說明二者的相關度較高，利用apple和orange的相關性，alogrithm可以很好的将一些“适用于orange的詞彙搭配”推廣到“apple”上（即：word embedding能夠顯著提高algorithm的泛化能力）。

需要注意的是，利用word embedding algorithm獲得的word的featurized representation，其各個次元所表征的意義很難人為界定（不像圖中所舉栗子，每個次元都有其所要表征的意義），雖然，我們無法明确featurized representation各個次元的意義，但是，不可否認，它是word的一種很好的表示方法。

deeplearning筆記5：序列模型
下圖為利用featurized representation表征的word的可視化圖，從圖中可以看出，這種特征向量，能夠清楚的反應不同word之間的相關關系（利用 t-SNE 可以将高維向量映射到低維空間，進而更加直覺的觀察不同word之間的相關關系）：

deeplearning筆記5：序列模型

t-SNE算法

一種新的Unsupervised learning算法：t-SNE

Python中T-SNE實作降維

TSNE——目前最好的降維方法

2.2 使用詞嵌入（word embedding）表示word

word embedding可以使algorithm具有更好的泛化能力

利用word embedding表示word，可以使algorithm具有更好的“泛化能力”，以下圖中所示task為例進行說明（note that：圖中的RNN應該修改為BRNN）：

圖中所示為：用RNN 進行 name recognition

1、假設word用one-hot vector進行表示：

如果apple farmer，orange farmer都在training data中，則當利用訓練好的BRNN識别“Robert Lin is an apple farmer”中的name entity時，BRNN能夠準确将Robert Lin 識别出來，但是，假設durian cultivator沒有出現在training data中，現在要識别的是句子“Robert Lin is a durian cultivator”中的name entity，則BRNN則可能無法準确識别出Robert Lin。這是因為，word使用的是one-hot 表示，而one-hot無法得出word之間的相關關系，是以BRNN判斷不出duriancultivator和orange farmer具有相似的意思。

2、如果在上述的name recognition task中，将word用word embedding表示，由于word embedding可以推斷出不同word之間的相關關系，是以，在使用訓練好的BRNN識别“Robert Lin is a durian cultivator”中的name entity時，即便durian cultivator沒有在training data中出現過，BRNN也可以根據durian cultivator的“特征向量”，判斷出他與orange farmer意思相近，進而，推斷出Robert Lin是一個name。

deeplearning筆記5：序列模型
transfer learning and word embedding

用word embedding algorithm在大量資料上學習到的word embedding，可以應用于其他的NLP task（小型資料量）中，具體闡述如下：

在transfer learning（A transfer to B)中：如果A的資料量很大，而B的資料量較小，這種情況下利用transfer learning 可以得到很好的效果。而當B中的資料量也較大時，此時，最好利用A的結果在B的資料集上，對word embedding 進行微調，然後在将微調後的word embedding應用于task B中（即：将微調後的word embedding作為“特征向量”表征task B中的word）。

deeplearning筆記5：序列模型
the relationship between face encoding and word embedding

face encoding和word embedding本質上都是一種“特征向量”，用來表征object。但是，二者的應用範圍差别很大：face encoding中，隻要訓練好Siamses network，便能得到任何一個image的編碼（将image輸入Siamese network，Siamese 将輸出該image的編碼），而word embedding中，通過一個word embedding algorithm隻能得到dictionary中word的編碼，對于dictionary以外的word則無能為力。

下圖所示為face encoding中，Siamese network結構圖。

deeplearning筆記5：序列模型

2.3 詞嵌入(word embedding)的特性

analogy reasoning

word embedding 能夠用于 “analogy reasoning”，如下圖所示：

下圖為各個word的word embedding(featurization representation)表示，通過word embedding不僅可以表征各個word之間的相似度，而且，可以進行analogy reasoning，比如：根據man -> woman，可得出king -> queen。該推演過程具體如下：

求解 eman - ewoman = eking - e?（其中eword為word的word embedding）

上式又可表示為：求解e?，使得similarity(e?，eking-eman+ewoman)最大。

兩個eword之間相似度的表示方法如下一部分所示：

deeplearning筆記5：序列模型
相似度定義

兩個word的相似度可以用2種表征方法：

1） cosine similarity：sim(u，v) = uTv/||u||*||v||；值越大，相似度越高；

2）歐幾裡得距離：||u-v||2；值越大，相似度越低；

deeplearning筆記5：序列模型
word的word embedding visulization

将各個word的word embedding進行visulization，如下圖所示：

從圖中，可以看出，在未進行“降維”之間，man->woman，king->queen，呈現為“平行四邊形”，表明，man->woman，king->queen，具有analogy reasoning。

需要注意的是，利用t-SNE，對word embedding進行降維處理後（300D 降到 2D）,将破壞原word pair之間的那種“平行四邊形”結構，因為，t-SNE為non-linear 降維方法。

deeplearning筆記5：序列模型

2.4 嵌入矩陣

嵌入矩陣：一個dictionary中所有單詞的word embedding vector組成的matirx，如下圖所示：

1、notation

oj：代表word j的one-hot vector；

ej：代表word j的word embedding；

E：embedding matrix；

2、embedding matrix

假設dictionary中有10,000個word，每個word的word embedding為300D，則“嵌入矩陣”為300*10,000 的matrix。

此時，ej = E * oj （1）。

note that：在實際應用中，提取word j的word embedding，并不采用（1）中的multiplication所示，因為，這種“提取方法”計算量過大，以Keras為例，其會利用special function直接從embedding matrix 中提取出column_j，也即ej。

deeplearning筆記5：序列模型

2.5 學習詞嵌入

本節中主要介紹一些“embedding word learning algorithm”。

以下圖為例，簡要介紹word embedding learning algorithm的核心思想：

下圖task是：learning language model，learning word embedding matrix。

training data：corpus of text。将corpus of text分解為：{X：給定context，y：待預測的next word}。

如：I want a glass of orange juice。該句training data中，X可以是"I want a glass of orange",y可以是“juice”。

這個task主要通過下述model完成：

1、model的結構：

input：context中各個word的word embedding。

output：context下一個word的預測值。

architecture：context X -> neural netwok -> softmat -> prediction y。

2、model的parameter和hyperparameter：

hyperparameter：n（用前n個word去預測next word）；

parameter： word embedding matrix，neural network weight，softmax weight；

3、model中parameter的求解過程：

step1：初始化各個parameter；

step2：model的cost function為“最大似然估計”，利用cost function的gradient descent（backward propagation），求解parameter的更新量。

step3：重複step2，直到達到停止條件，停止iteration。

deeplearning筆記5：序列模型
language model 和 word embedding 學習過程中，context的設定

如下圖所示：

1）在language model learning中，其context(input)可以選用：last 4 words；

2）在word embedding learning中，其context(input)可以選用：

type1：待預測word，前後4個word一起作為context，即： 4 words on left of the prediction and 4 words on right of the prediction；

type2：last 1 word；

type3：nearby 1 word；

deeplearning筆記5：序列模型
在下一節中，我們将介紹一種word embedding learning algorithm “Word2Vec”。

2.6 Word2Vec

Word2Vec

本節将講解一個word embedding learning algorithm：Word2Vec ，Word2Vec包含2個version，分别為skip-grams，和CBow。二者learning word embedding的核心思想相同，但是其（input，target）的選取方式不同，具體如下：
skip-grams：

input：從 training text中，sample一個word作為input(context)；

target：其要預測的word為該context“fixed window”内的word。如：從該context 前10或後10個word中，sample一個作為要預測的word。

output：dictionary中各個word為prediction word的機率值。
CBow：

input：在待預測的word(target)兩側，随機選取一個word作為context（input）；

output：dictionary中各個word為prediction word的機率值。

本節重點講解skip-grams algorithm：

下述PPT中，為skip-grams中（context，target）的選取（參考上邊所述規則）。

skip-grams algorithm的結構為： context(word embedding) -> softmax ->output(probability of each word of dictionary)；

note that：skip-grams并不能得到好的language model，因為，其(context，target)幾乎是随機標明的，但是，通過skip-grams可以得到很好的word embedding.

deeplearning筆記5：序列模型

下圖為skip-grams model:

skip-grams的結構為：context(word embedding) -> softmax -> probability of each word of dictinary(p(t|c)的求解公式如下圖所示，t：target，c：context)；

skip-grams的cost function為：“最大似然估計”(L(yhat, y)，如下圖示，公式中，yi為預測word_i的one-hot vector，yhati為model輸出的機率向量，yi * yhati為預測word的機率值)。

skip-grams的parameter為：word embedding matrix，softmax parameter。通過“最大似然估計”(gradient descent)，即可求解這些參數。

在skip-grams中，存在一個downside，即：p(t|c)的計算量很大，非常耗時，未解決這一問題，有學者提出了“hierachical softmax”，具體如下一部分所示。

deeplearning筆記5：序列模型
下圖所示為“為改進skip-grams中p(t|c)計算量過大的問題”，而引進的hierachical softmax classifier：

如圖所示，根據“hierachical softmax classifier ”，判斷predicting word機率p(t|c)，其具體方法如下：

首先聲明：

在“hierachical softmax classifier”中，每個node為一個logistic classifier。

“hierachical softmax classifier”的每個内部結點，代表一個dictionary 範圍，如：根節點代表dictionary，根節點下的兩個左右分支，left-branch代表“dictionary中前5000個word，設為section_l”，right-branch代表“dictionary中後5000個word，設為section_r”。left-branch中的兩個分支{left：section_l中前2500個word，right：section_l中後2500個word}。

p(t|c)計算方法：

個人了解：将context輸入根節點，一路判斷其屬于哪個分支，直到其落到“葉子節點”，該葉子節點即為predicting word的機率值p(t|c)。（每個葉子節點為dictionary中的一個word，各個葉子節點的function是不一樣的，是以，利用context作為input，求得的dictionary中每個word的機率值p(y|c)也不一樣。）

在“hierachical softmax classifier”中，p(t|c)的計算量從linear in vocabulary size 降為了 log in vocabulary size。

note that：“hierachical softmax classifier”在實際應用中往往不是一顆“balance tree”（如圖 right-hand）。其shallow layer中存放的是“較常用到的word”，其deep layer中存放的是“較少用到的word”，這樣的“非平衡樹”可以進一步減少p(t|c)的計算量。

deeplearning筆記5：序列模型
在skip-grams中，context（input word）的選取方式

在skip-grams中，input word不能用uniform distribution進行随機抽取，因為，這種方式，抽到的input word基本上都是常用word，如：the , of ,a ,and 等，而類似orange, apple ,durian等真正想要的word抽到的幾率很小，這樣，skip grams algorithm将花費很多力氣去訓練這些meaningless common word，而忽視了concerned word。

未解決skip grams中，context的sample問題，很多學者提出了“啟發性政策”，詳情，自行查閱文獻。

2.7 負采樣（negative sampling）

“負采樣”是skip-grams algorithm的一種改進算法，它利用batch的思想，解決了skip-grams algorithm中，p(t|c)計算量巨大的難題。

“負采樣”樣training data的建立

“負采樣”中，training data的形式為（context，word，target）。

在一個batch training data中，共有1+k個training data，他們擁有相同的context，但是，隻有一個target=1(mean：context，word均來自training text)，其它k個target=0(mean：context來自training text，word從dictionary中sample）。

notation：

target=1的sample為positive sample;

target=0的sample我negative sample;

note that：k的取值：對于smaller data set，k為(5-20)；對于larger data set，k為（2-5）；

target=1時，word的sample方式，與skip-grams中target的sample方式相同（skip-grams中training data的形式為（context，target））。

target=0時，word的sample方式可以采用以下形式：根據p(wi) distribution從dictionary中sample word（wi為dictionary中的word），p(wi)的計算公式如下：

deeplearning筆記5：序列模型
“負采樣”算法

負采樣算法核心思想：

具體如下圖所示：

“負采樣”training data=（context，word，target）;

“負采樣”model的input為：context(word embedding) ,表示為 ec ；

“負采樣”model的output為：給定input 後，target的機率值，為k+1維vector，表示為p(y|c,t)。

“負采樣”model為logistic regression，公式為 p(y|c,t) = sigmoid(thetaT * ec)，其中，theta和ec在“負采樣”算法中均是parameter。

“負采樣”中cost function為“最大似然估計”，其利用gradient descent求解parameter。

“負采樣”與skip-grams最大的不同是，在每次iteration中，“負采樣”采用batch training data求解parameter，而skip-grams則是采用“所有的training data”求解parameter（注意：每個batch training data為k+1個sample），也正因為這種不同，使得“負采樣”可以以較小的計算量求解sigmoid function（因為，“負采樣”中，每個iteration隻計算k+1個output，而在skip-grams中，每個iteration需要計算the length of dictionary個output）.

note that：每個batch（context,word,target）中，有1個target=1(positive sample)，k個target=0(negative sample)；

deeplearning筆記5：序列模型

2.8 GloVe詞向量

GloVec中的notation

deeplearning筆記5：序列模型

GloVe algorithm中training data形式為（context，target）,二者皆為training corpus of text中的word。

Xij：代表context j和target i同時出現的次數。

在GloVec中，其parameter同樣有兩部分，分别為theta 和 ej，二者定義具體參見下圖中對GloVec algorithm的講解：
GloVec algorithm

deeplearning筆記5：序列模型
建構GloVec的目标函數：

deeplearning筆記5：序列模型
目标函數建構以後：可以用gradient descent求解參數；
GloVec algorithm建構的word embedding不具備“interperation”

如下圖所示：

根據GloVec最後所得的embedding vector并不能為人所解釋（embedding vector各個次元并不一定互相垂直），這是因為，embedding vector的各個axis可能是綜合了多個“性質”的綜合體（如上圖：ew,1同時綜合了gender和royal的屬性特征），盡管GloVec所得embedding vetor缺乏可解釋性，但他依然可以很好的應用于analogy reasoning；

上圖中公式：(A*theta_j)T *(A-1T * ej) = theta_iT * ej，表明，1）由GloVec得到的parameter并不一定為orthogonal；2）GloVec得到的parameter為“數值解”，而不是“解析解”。

deeplearning筆記5：序列模型

Recap：在“序列模型”這一部分，提到的word embedding learning algorithm有以下幾種：

word2vector (skip-grams ,CBow)；

負采樣算法（與skip-grams相近，但是解決了skip-grams中p(t|c)計算量過大的問題）；

GloVec algorithm；

2.9 情緒分類

In the sentiment classification,you may face that there is not a huge label training set。10,000-100,000 data set is common. Using word embedding may help you do well in a small training set;

本節主要講解利用“word embedding”進行“情緒分類”的2中方法：

average input(word embedding)

如下圖所示：

将“評論”中各個word的word embedding vector進行average，然後輸入softmax function，可以output 5顆star的probability。

在這個算法中：

training data為（評論，star）；

parameter為softmax中的參數；

cost function為“最大似然估計”，利用gradient descent可以求得softmax中各個parameter，用以預測test sample；

值得一提的是，即便test sample的某些word在“average”training data中從未出現過，由于test sample使用word embedding vector表征word，是以，依然可以根據“average”model得到很好的預測結果。

除此以外，訓練word 的 word embedding ，可以獨立于average 算法進行，即，在其它的large dataset上用word embedding learning algorithm訓練好word embedding，然後将這些word embedding應用于average算法中的training data，這也是transfer learning的主要思想（transfer A to B中，A應為large dataset，B為small dataset，可達到更好的遷移學習效果）。

downside：average 算法中，并不考慮word出現的先後順序，是以，可能給一個“負面評價”以“正面評分”，如： lacking a good tast ,a good service,good ambient；本是負面評價，但是由于該句子中出現太多good等positive詞彙，是以，可能導緻average算法将該評論視為正面評論。解決這一問題的辦法為：利用RNN，進行sentiment classification.

deeplearning筆記5：序列模型
利用RNN and word embedding進行“情緒分類”

其結構圖如下：

在該結構中，softmax的輸出依然為5個star的probability，當star對應的probability>0.5時，則點亮star。

在RNN中，依然可以用transfer learning進行“sentiment classification”:

A task：從large dataset學習word embedding；

B task：sentiment classification，僅有small training data；

可通過transfer A to B，進行sentiment classification.

deeplearning筆記5：序列模型

2.10 詞嵌入除偏

本節中，将展示一些“去除word embedding中各種偏見”的方法：

如下圖所示：

利用algorithm訓練得到word embedding，在已知：man -> computer_programmer的情況下，woman得到的映射為homemaker。這顯然是具有性别歧視的，為了消除word embedding中的這一“gender bias”（word embedding中含有gender bias，反映了，其training text中本身含有的gender bias），學者引入了很多方法，本節簡單列舉一例。

deeplearning筆記5：序列模型

下圖中列舉出了一個“解決gender bias”的方法：

如下圖：

grandmother和grandfather，girl和boy，she和he這些都是具有性别傾向的word。

但是，babysitter，doctor等word本身沒有性别傾向，是以，他們與she和he間的相似度理論上應該是相等的，但是，實際上，卻不是，為了解決這一問題，給出以下辦法（圖right-hand）：

step1：首先确定bias direction，可用ehe - eshe，表示bias direction;

step2：将詞性中立的word投影到bias direction的垂直方向上(unbias direction)，消除這些word的gender bias；

step3：将具有性别傾向的word pair，使他們與中立詞(如：doctor)之間的距離相等，進而消除：analogy reasoning時的gender bias；

question：在eliminate gender bias時，我們如何确定哪些word為中性詞，哪些word有gender傾向？

answer：可以訓練一個linear classifier來區分這些詞，其中，由于具有性别傾向的word pair比較少見，是以，可以将這些詞hand-picking，用以标記training data which is used to fit linear classifier model。

deeplearning筆記5：序列模型

week(三) 序列模型和注意力機制

3.1 基礎模型

本節中主要介紹了sequence to sequence model的兩個應用場景：

machine translation

下圖所示為machine translation model（from Franch to English）:

model的前半部分為encoding part，model的後半部分為decoding part。

deeplearning筆記5：序列模型
image caption

下圖所示為image caption model：給定一張image，利用model，輸出image的标題。

在該model中，image同過一個“卷積神經網絡”進行encoding，然後，通過一個RNN網絡，進行decoding（輸出caption）。

deeplearning筆記5：序列模型

3.2 選擇最可能的句子

machine translation model可以看成是conditional language model，具體如下圖所示：

從下圖可以看出，在language model中，其model前半部分為a<0>，後半部分為decoding part。在machine translation model中，其model前半部分為encoding part，後半部分為decoding part。如果将language model中的a<0>用machine translation中的encoding part代替，則language model與machine translation model完全一緻。是以，也把machine translation model看成是conditional language model。在之前所講的language model，其output word為randomly sample based on a distribution of output。

在這一節中，我們想要machine translation model達到的目标是，能夠output最優可能的translation，那麼，這個目的如何達到呢？請看下一部分。

deeplearning筆記5：序列模型
finding the most likely translation

如下圖所示，過去所講的language model的output word是randomly sample based on the distribution of output的結果，是以，多次輸出的同一Franch sentence的translation可以好壞各異，如下圖示：

為了找到一個最好的translation，我們設定如下的目标函數：argmax P(y<1>,…,y<Ty> | x)，要滿足這個目标函數，我們有2種方法，請看下一部分。

deeplearning筆記5：序列模型
argmax P(y<1>,…,y<Ty> | x)的方法

1）貪婪算法

依次找到使output y的機率值P(y|x)最大的那個output word，由這一系列word構成的output，可以使得P(y<1>,…,y<Ty> | x)達到最大。

note that：貪婪算法中的最優解為局部最優解。

需要注意的是，貪婪算法，其實，并不能找到“最好的translation sentence”，原因如下圖所示：

如下圖中的2個translation sentence，很明顯，第一句優于第二句，但是，如果用貪婪算法的話，則由于p(Jan is going|x) > p(Jan is visiting|x)，貪婪算法，會錯過“best translation sentence”。是以，貪婪算法，不可取。

deeplearning筆記5：序列模型

2）search algorithm

假設dictionary中word有10,000個，output sentence長度為10。則要滿足目标函數要求，可以從10,00010個可能的output sentence中，選出一個probability最高的sentence，但是，由于這種方法搜尋量太大，是以不能直接執行，為此，我們可以設計一些search algorithm，簡化搜尋過程，進而，從所有可能中，找到一個“近似最優解”。

下節中，将介紹一些search algorithm，用于尋找best translation sentence。

3.3 定向搜尋（Beam search）

Beam search的核心思想：

step1：以machine translation application為例說明，設定beam search的width=n，則在algorithm的decoding part，當output first word時，output3個probability p(y<1>|x)最大的word。如下圖所示：這3個word分别為：in，jane，september。

deeplearning筆記5：序列模型

step2：分别以step1中標明的3個word為y<1>，然後，尋找使得y<2>的probability p(y<2> | x, y<1>)最大的3個wordy<2>。在step2中，從找到的9個(y<1>，y<2>)組合中，找出3組probability最大組合（p(y<1>，y<2>|x)）。在這3個組合的基礎上，繼續找y<3>。

deeplearning筆記5：序列模型

step3：重複step2的步驟，找到3組probability最大的(y<1>，y<2>，y<3>)。

deeplearning筆記5：序列模型

一直重複上述步驟，直到sentence結束。此時，即可從3個translation sentence中，選出probability最大的一個translation。

note that：beam search 比 greed algorithm效果要好。

在下一節中，将講述一些 beam search的改進方法，進而使其能夠得到更好的結果。

3.4 改進定向搜尋

本節主要講述一個改進“beam search”的方法：

首先，來看一下beam search存在的缺陷，如下圖所示：

下圖中，第一個公式為“beam search”的目标函數，這個公式存在以下缺陷：當translation sentence過長是，各個output word的probability multiplication将會非常小，可能導緻numerical underflow；

為了解決這個downside，我們采用下圖中公式2，即在原目标函數的基礎上，加log，但是，這個目标函數依然存在一個downside，即：

對于shorter translation sentence，與longer translation sentence相比，其勢必會得到一個較大的probability（這是因為shorter sentence隻有幾個word的probability要乘，是以，不會使得小數縮減太多），由于這個原因，目标函數更傾向于選擇shorter translation sentence。

為了解決這一問題，我們引入了公式3，即對公式2進行normalization（即，對目标函數除以Ty：length of translation sentence），通過normalization，使得目标函數對于translation sentence的長度沒有了偏好，因而，可以更加公正的選取translation sentence。需要注意的一點是，在進行normalization時，可以給Ty加一個指數，即：Tyalpha，0<= alpha<=1，通過選擇alpha的值，可以決定目标函數是進行完全normalization(alpha=1)，而是完全不進行normalization(alpha=0)。

deeplearning筆記5：序列模型

在對beam search的目标函數進行modification以後，我們可以通過以下方法，尋找best translation sentence：

step1：定義beam search width = k（可以嘗試在不同的k下執行以下幾步，k可選1,3,10…。一般，在科研界，未得到較好的結果，k可達1000到3000不等）；

step2：分别對Ty=1,2,3,…,30時，求各個Ty下的前k個最佳translation sentence；

step3：根據上圖中的公式3，求這些選出來的最佳translation sentence的機率值，選取probability 最大的translation sentence作為最後的translation sentence。

note that：

deeplearning筆記5：序列模型

3.5 定向搜尋的誤差分析

在machine translation中，model包含兩部分（如下圖所示）：1）beam search ；2）RNN。當你的優化問題出現錯誤時，error analysis可以使你明白，是beam search出現錯誤，還是RNN出現錯誤。

deeplearning筆記5：序列模型

下面具體講解error analysis的過程，如下圖示：

假設現有Franch to English的translation task，下面列出了對于同一條Franch的，Human translation（y*）和algorithm translation（yhat）結果。

理論上來講，y*應該由于yhat，是以，如果machine translation最終選擇的翻譯結果為yhat，說明machine translation model的某一部分發生錯誤，具體，可以用error analysis進行分析，究竟是哪一塊出了問題。分别利用algorithm計算p(y*|x)和p(yhat|x)的機率值，當：

case1：p(y*|x) > p(yhat|x)

此時，說明，beam search沒有将機率值較大的y*選出來，說明beam search width需要進一步調整，以使beam search能夠選出正确的translation sentence；

case2：p(y*|x) < p(yhat|x)

此時，說明，RNN對于y* 和 yhat的機率評估是錯誤的，應該調整RNN（判斷RNN是bias問題，還是variance問題，然後根據相應問題，選擇下列解決方案：增加training data；regularization；調整RNN architecture）。

deeplearning筆記5：序列模型

将error analysis應用于實際的machine learning中，如下圖所示：

将fit後的model應用于dev set，得到下列的“誤分結果”，對這些“誤分sample”進行分析，當sample的p(y*|x) > p(yhat|x)時，說明是beam search 錯誤，反之，則為RNN錯誤。

記錄在“誤分sample”中，beam search錯誤的個數，以及RNN錯誤的個數，将錯誤頻次較高的model part（如：beam search）看成是machine translation model應該重點調整的對象，具體modification辦法，如前所述。

deeplearning筆記5：序列模型

3.6 Bleu 得分

Bleu score可以作為 a single real number evaluation metric ，來評價machine translation algorithm工作的優劣（對于給定的Franch，Bleu score可以根據reference評價由machine translaition algorithm得到的translation sentence的優劣程度（translation sentence的優劣是以reference為參照物，來界定的））。其中，reference是人工翻譯結果。本節，主要介紹一下Bleu score的大體工作原理，具體詳情，參見PPT下方literature：

notation

deeplearning筆記5：序列模型

Countclip(“the”)：是指MT output中的the出現在reference1 和 reference 2中的次數n1,n2，取max(n1,n2)；

Count(“the”)：是指MT output 中the出現的次數；

Reference：是由人工翻譯出的sentence；

MT output：是指machine translation algorithm output；
Bleu score unit 計算公式

将MT output sentence與 reference sentence作比較，通過Bleu score的計算，可以得出MT output sentence的優劣程度；

以下為Bleu score unit的計算公式（如下圖示）：

deeplearning筆記5：序列模型

下圖中列出了Bleu score的計算公式：

left-hand：為以MT output中single word計算Bleu score on uni-grams；

right-hand：為以MT output中相鄰的n個word計算Bleu score on bi-grams；

deeplearning筆記5：序列模型

下面，舉例說明：以MT output中鄰近2個word來計算Bleu score：

如下圖left-hand，列舉出了MT output中連續2個word的list:

計算這些word pair在reference中出現的最大次數（countclip），以及在MT output中出現的次數（count）。二者分别summation，然後相除：得：P2 = 4/6。

deeplearning筆記5：序列模型
Bleu score計算公式

下面列出了Bleu的計算公式：

deeplearning筆記5：序列模型

deeplearning筆記5：序列模型

Question：當ML-output較短時，BP反而很大，豈不是進一步增加了short ML-output 的得分嗎？（可能上式隻是intuition，具體參見literature）。

note that：Bleu score除能用于評價machine translation algorithm generate translation sentence 的precision外（based on reference），也可用于評價其它text generation 的algorithm，如 image caption。

值得一提的是，在實際中，很少有人從0開始訓練Bleu score，往往是使用一些網上已經訓練好的Bleu score，将其作為評價系統，直接應用于自己的system中。

question：如何訓練Bleu score？

個人了解：現有training data（Franch，reference），根據訓練好的translation algorithm，将Franch翻譯為English，并且根據reference，計算這個algorithm翻譯的精确度。從這個角度了解，Bleu score的應用，隻需給已經訓練好的algorithm在喂入training data，計算Bleu score既可，不需要額外訓練什麼Bleu score algorithm呀？

3.7-3.8 注意力模型直覺了解

the downside of previous machine translation model

在講解attention model之前，我們先看一下上述所講的machine translation model的downside，如下圖所示：

在翻譯一個較短的sentence時，previous machine translation model能夠得到一個很好的Bleu score，但是，随着sentence長度的增加，該model的Bleu score也随之降低，這是因為，previous machine translation model僅能記住有限長度的sentence，當sentence過長時，其decoding part的精确性将随着時間的推移而顯著下降，進而導緻後邊一點的output yhatTy 精度嚴重受損。為了解決model的這種問題，我們引入的“attention model”，它使得每個output word的産生僅基于一小段sentence，而不是the whole sentence（其具體實作方式，是通過給每個input word添加attention weight，進而可以highlight用于output word的重點input word，而将其他遠離output的input word忽視掉）。具體attention model原理，見下一部分。

deeplearning筆記5：序列模型
attention model

1）attention model intuition

如下圖所示：

attention model 是由2層RNN組成，下層BRNN用于計算input word的輸出，在attention model中，每層BRNN的輸出有3個内容：a<t>=(a->t，a<-t)，attention weight alpha<t’,t>。其中t: 第t個input word，t’：第t’個output word。

attention model中的上層RNN用于輸出word y<t’>，y<t’>的計算與下列因素有關：alpha<t’,t> ，a<t>，s<t’-1>,y<t’-1>。其中，s<t’-1>為attention model上層RNN中，第t’-1 layer的activation。而y<t’-1>将作為t’ layer的input x<t’>輸入。alpha<t’,t> * a<t>決定了y<t’> 依賴于哪一部分sentence決定（即：分别放多少關注度在input word_t上）。

由上可知：attention model的直覺釋義為：每一個output word分别由the whole sentence中的部分sentence決定。

關于attention model各個parameter的計算請看下一部分。

deeplearning筆記5：序列模型

2）attention model詳解

2.1）這一部分先看attention model上層RNN中，各個layer的input的計算方式，以layer 2為例，其input有以下3部分：

S<1>：layer 1的activation；

y<1> as input x<2>：layer 1的output word vector（embedding word）；

c<2> = sum(alpha<2,t> * a<t>) t=1,2,…,Tx。（t為input word的index）。

将上述3個input輸入active function即可獲得S<2>。将S<2>應用于output function,即可得到output probability vector，選取機率最大值對應word為output word。

deeplearning筆記5：序列模型

2.2）在這一部分，主要講解attention weight的求解方式，具體如下圖所示:

note that：在下圖公式中，t與t’的含義，與上述幾部分的正好相反。

如下圖所示，attention weight是由含一層hidden layer的neural network計算而來（alpha<t,t’>通過計算e<t,t’>的softmax而得，e<t,t’>由neural network計算而來），neural network的input為s<t-1>，a<t’>。該neural network中的weight與attention model中的parameter一并通過backward propagation計算獲得。alpha求解公式具體如圖：

deeplearning筆記5：序列模型
Attention example

deeplearning筆記5：序列模型

3.9 語音辨識

在傳統的“語音識别”中，audio clip 輸入speech recognition algorithm中，并不能直接output text transcript，而隻能得到phonemes，在由這些phonemes組合得到text transcript。

随着deep learning 的發展，輸入audio clip後，可以直接output transcript。

在speech recognition中，audio clip要先做一些preprocess才可以輸入algorithm，具體如圖left-hand（将audio clip 處理為 “聲譜圖”，橫坐标：time，縱坐标：the intensity of energy）。

deeplearning筆記5：序列模型

本節介紹了2種 speech recognition 的實作方式：

通過attention model實作speech recognition

在attention model中，每個output輸出一個“字母”。

input為“聲譜圖”？？？

deeplearning筆記5：序列模型
CTC cost for speech recognition

在該model中，有1000個input就會有1000個output。

在speech recognition中，input 數量 > “字母” 數量，是以，對應到model上，不可能每個input都對應一個有效的output。未解決這個問題，在CTC中，允許output為重複值，如：ttt，也允許output為blank 用_标記，具體如下圖所示：

當output為： ttt_h_eee___[space]_qqq…時，我們首先輸出the，由于其後有space，可以判斷q為下一個單詞的開始。

采用這種model結構同樣可以進行speech recognition。

deeplearning筆記5：序列模型

3.10 觸發字檢測

what is trigger word detection

當你喊出“trigger word”時，machine 啟動；

deeplearning筆記5：序列模型
利用RNN來建構一個trigger word detection algorithm

如下圖所示：

input為“spectrum featurization(由audio clip處理而成)”，在model未檢測到trigger word之前，model output為0，檢測到trigger word時，output為1。

為了平衡training data中的target 中0多1少的unbalance situation，可以将檢測到trigger word後的一個時間段内output都設為1，過了這個時間間隔後，在将output設為0，直到在次檢測到trigger word為止。

deeplearning筆記5：序列模型

deeplearning筆記5：序列模型

week（一）循環序列模型

1.1為什麼選擇序列模型

1.2 數學符号

1.3循環神經網絡模型

1.4 通過時間的反向傳播

1.5 不同類型的循環神經網絡

1.6 語言模型和序列生成

1.7 對新序列采樣

1.8 帶有神經網絡的梯度消失

1.9 GRU單元

1.10 長短期記憶（LSTM）

1.11 雙向神經網絡(BRNN:bidirectional neural network)

1.12 深層循環神經網絡

week(二) 自然語言處理與嵌入

2.1 詞彙表征

2.2 使用詞嵌入（word embedding）表示word

2.3 詞嵌入(word embedding)的特性

2.4 嵌入矩陣

2.5 學習詞嵌入

2.6 Word2Vec

2.7 負采樣（negative sampling）

2.8 GloVe詞向量

2.9 情緒分類

2.10 詞嵌入除偏

week(三) 序列模型和注意力機制

3.1 基礎模型

3.2 選擇最可能的句子

3.3 定向搜尋（Beam search）

3.4 改進定向搜尋

3.5 定向搜尋的誤差分析

3.6 Bleu 得分

3.7-3.8 注意力模型直覺了解

3.9 語音辨識

3.10 觸發字檢測

繼續閱讀