encoder decoder模型_Pytorch學習記錄-卷積Seq2Seq（模型實作）

Pytorch學習記錄-torchtext和Pytorch的執行個體4

0. PyTorch Seq2Seq項目介紹

在完成基本的torchtext之後，找到了這個教程，《基于Pytorch和torchtext來了解和實作seq2seq模型》。這個項目主要包括了6個子項目

~~使用神經網絡訓練Seq2Seq~~
~~使用RNN encoder-decoder訓練短語表示用于統計機器翻譯~~
~~使用共同學習完成NMT的堆砌和翻譯~~
~~打包填充序列、掩碼和推理~~
卷積Seq2Seq
Transformer

5. 卷積Seq2Seq

5.1 準備資料

還是老一套，使用torchtext對英-德語料進行處理。

import torchimport torch.nn as nnimport torch.optim as optimimport torch.nn.functional as Ffrom torchtext.datasets import TranslationDataset, Multi30kfrom torchtext.data import Field, BucketIteratorimport spacyimport randomimport mathimport timeSEED=1234random.seed(SEED)torch.manual_seed(SEED)torch.backends.cudnn.deterministic=Truespacy_de=spacy.load('de')spacy_en=spacy.load('en')def tokenize_de(text): return [tok.text for tok in spacy_de.tokenizer(text)]def tokenize_en(text): return [tok.text for tok in spacy_en.tokenizer(text)]SRC=Field(tokenize=tokenize_de, init_token='', eos_token='', lower=True, batch_first=True)TRG=Field(tokenize=tokenize_en, init_token='', eos_token='', lower=True, batch_first=True)train_data,valid_data,test_data=Multi30k.splits(exts=('.de','.en'),fields=(SRC,TRG))SRC.build_vocab(train_data,min_freq=2)TRG.build_vocab(train_data,min_freq=2)device=torch.device('cuda' if torch.cuda.is_available else 'cpu')BATCH_SIZE=128train_iterator,valid_iterator,test_iterator=BucketIterator.splits( (train_data,valid_data,test_data), batch_size=BATCH_SIZE, device=device)

5.2 構模組化型

這個是教程的原圖，但是沒有做什麼解讀，我又找了一篇解讀的教程看。

encoder decoder模型_Pytorch學習記錄-卷積Seq2Seq（模型實作）

該模型依舊是 encoder-decoder + attention 子產品的大架構：encoder 和 decoder 采用了相同的卷積結構，其中的非線性部分采用的是門控結構 gated linear units(GLU)；attention 部分采用的是多跳注意力機制 multi-hop attention，也即在 decoder 的每一個卷積層都會進行 attention 操作，并将結果輸入到下一層。

encoder decoder模型_Pytorch學習記錄-卷積Seq2Seq（模型實作）

卷積塊結構 encoder 和 decoder 都是由 l 層卷積層構成，encoder 輸出為 $z^l$ ，decoder輸出為 $h^l$。由于卷積網絡是層級結構，通過層級疊加能夠得到遠距離的兩個詞之間的關系資訊。這裡把一次 "卷積計算+非線性計算" 看作一個單元 Convolutional Block，這個單元在一個卷積層内是共享的。
卷積塊中包括了卷積計算、非線性計算、殘差連接配接和輸出
多步注意力原理與傳統的 attention 相似，attention 權重由 decoder 的目前輸出 $hi$ 和 encoder 的所有輸出共同決定，利用該權重對 encoder 的輸出進行權重，得到了表示輸入句子資訊的向量 $ci$，$ci$ 和 $hi$ 相加組成新的 $hi$。 $$d^li = Wd^l h^li + b^ld + gi$$ $$a^l{ij} = frac{exp(d^li cdot z^uj)}{sum{t=1}^{m}{exp(d^li cdot z^ut)}}$$ $$c^li = sum{j=1}^{m}{a^l{ij}(z^uj + e_j)}$$ 在每一個卷積層都會進行 attention 的操作，得到的結果輸入到下一層卷積層，這就是多跳注意機制 multi-hop attention。這樣做的好處是使得模型在得到下一個注意時，能夠考慮到之前的已經注意過的詞。

class Encoder(nn.Module): def __init__(self, input_dim, emb_dim, hid_dim, n_layers, kernel_size, dropout, device): super(Encoder, self).__init__() assert kernel_size %2 ==1,"Kernel size must be odd!" self.input_dim=input_dim self.emb_dim=emb_dim self.hid_dim=hid_dim self.kernel_size=kernel_size self.dropout=dropout self.device=device self.scale=torch.sqrt(torch.FloatTensor([0.5])).to(device) # 詞嵌入 self.tok_embedding=nn.Embedding(input_dim,emb_dim) # 位置資訊 self.pos_embedding=nn.Embedding(100,emb_dim) self.emb2hid=nn.Linear(emb_dim,hid_dim) self.hid2emb=nn.Linear(hid_dim,emb_dim) self.convs=nn.ModuleList([nn.Conv1d(in_channels=hid_dim,out_channels=2*hid_dim,kernel_size=kernel_size,padding=(kernel_size-1)//2) for _ in range(n_layers)]) self.dropout=nn.Dropout(dropout) def forward(self, src): # src=[batch_size, src_sent_len] # 構造pos張量，就是使用src的格式建構一個相同的batch_size的張量 pos=torch.arrange(0,src.shape[1]).unsqueeze(0).repeat(src.shape[0],1).to(self.device) # pos=[batch_size, src_sent_len] # 對tok和pos都做詞嵌入 tok_embedded=self.tok_embedded(src) pos_embedded=self.pos_embedded(pos) #tok_embedded = pos_embedded = [batch size, src sent len, emb dim] # 将tok_embedded和pos_embedded整合起來 embedded=self.dropout(tok_embedded+pos_embedded) # 通過linear層将嵌入好的資料傳入轉為hid_dim conv_input=self.emb2hid(embedded) #conv_input = [batch size, hid dim, src sent len] for i,conv in enumerate(self.convs): conved=conv(self.dropout(conv_input)) #conved = [batch size, 2*hid dim, src sent len] conved=F.glu(conved,dim=1) #conved = [batch size, hid dim, src sent len] # 傳入殘差連接配接 conved=(conved+conv_input)*self.scale #conved = [batch size, hid dim, src sent len] conv_input=conved # 使用permute進行轉置，将最後一個元素的轉為emb_dim conved=self.hid2emb(conved.permute(0,2,1)) #conved = [batch size, src sent len, emb dim] combined=(conved+embedded)*self.scale return conved,combined

5.2.2 Decoder

Decoder部分包括了attention結構，看一下代碼會發現出了增加了attnhid2emb和attnemb2hid，其餘類似。

class Decoder(nn.Module): def __init__(self, output_dim, emb_dim, hid_dim, n_layers, kernel_size, dropout, pad_idx, device): super(Decoder, self).__init__() self.output_dim = output_dim self.emb_dim = emb_dim self.hid_dim = hid_dim self.kernel_size = kernel_size self.dropout = dropout self.pad_idx = pad_idx self.device = device self.scale=torch.sqrt(torch.FloatTensor([0.5])).to(device) self.tok_embedding = nn.Embedding(output_dim, emb_dim) self.pos_embedding = nn.Embedding(100, emb_dim) self.emb2hid = nn.Linear(emb_dim, hid_dim) self.hid2emb = nn.Linear(hid_dim, emb_dim) self.attn_hid2emb=nn.Linear(hid_dim,emb_dim) self.attn_emb2hid=nn.Linear(emb_dim,hid_dim) self.out=nn.Linear(emb_dim,output_dim) self.convs = nn.ModuleList([nn.Conv1d(hid_dim, 2*hid_dim, kernel_size) for _ in range(n_layers)]) self.dropout = nn.Dropout(dropout) def calculate_attention(self, embedded, conved, encoder_conved,encoder_combined): conved_emb=self.attn_hid2emb(conved.permute(0,2,1)) #conved_emb = [batch size, trg sent len, emb dim] combined=(embedded+conved_emb)*self.scale #combined = [batch size, trg sent len, emb dim] energy=torch.matmul(combined, encoder_conved.permute(0,2,1)) #energy = [batch size, trg sent len, src sent len] attention=F.softmax(energy, dim=2) #attention = [batch size, trg sent len, src sent len] attended_encoding=torch.matmul(attention,(encoder_conved+encoder_combined)) #attended_encoding = [batch size, trg sent len, emd dim] attended_combined = (conved + attended_encoding.permute(0, 2, 1)) * self.scale #attended_combined = [batch size, hid dim, trg sent len] return attention, attended_combined def forward(self, trg, encoder_conved, encoder_combined): #trg = [batch size, trg sent len] #pos = [batch size, trg sent len] #encoder_conved = encoder_combined = [batch size, src sent len, emb dim] pos=torch.arange(0, trg.shape[1]).unsqueeze(0).repeat(trg.shape[0], 1).to(device) tok_embedded = self.tok_embedding(trg) pos_embedded = self.pos_embedding(pos) #tok_embedded = [batch size, trg sent len, emb dim] #pos_embedded = [batch size, trg sent len, emb dim] embedded = self.dropout(tok_embedded + pos_embedded) #embedded = [batch size, trg sent len, emb dim] conv_input=self.emb2hid(embedded) #conv_input = [batch size, trg sent len, hid dim] conv_input=conv_input.permute(0,2,1) #conv_input = [batch size, hid dim, trg sent len] for i , conv in enumerate(self.convs): conv_input=self.dropout(conv_input) #need to pad so decoder can't "cheat" padding = torch.zeros(conv_input.shape[0], conv_input.shape[1], self.kernel_size-1).fill_(self.pad_idx).to(device) padded_conv_input = torch.cat((padding, conv_input), dim=2) conved=conv(padded_conv_input) #conved = [batch size, 2*hid dim, trg sent len] conved=F.glu(conved, dim=1) #conved = [batch size, hid dim, trg sent len] attention, conved = self.calculate_attention(embedded, conved, encoder_conved, encoder_combined) #attention = [batch size, trg sent len, src sent len] #conved = [batch size, hid dim, trg sent len] conved=(conved+conv_input)*self.scale conv_input=conved conved=self.hid2emb(conved.permute(0,2,1)) output=self.out(self.dropout(conved)) return output, attention

5.2.3 模型整合Seq2Seq

class Seq2Seq(nn.Module): def __init__(self, encoder, decoder, device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward(self, src, trg): #src = [batch size, src sent len] #trg = [batch size, trg sent len] #calculate z^u (encoder_conved) and e (encoder_combined) #encoder_conved is output from final encoder conv. block #encoder_combined is encoder_conved plus (elementwise) src embedding plus positional embeddings  encoder_conved, encoder_combined = self.encoder(src) #encoder_conved = [batch size, src sent len, emb dim] #encoder_combined = [batch size, src sent len, emb dim] #calculate predictions of next words #output is a batch of predictions for each word in the trg sentence #attention a batch of attention scores across the src sentence for each word in the trg sentence output, attention = self.decoder(trg, encoder_conved, encoder_combined) #output = [batch size, trg sent len, output dim] #attention = [batch size, trg sent len, src sent len] return output, attention

encoder decoder模型_Pytorch學習記錄-卷積Seq2Seq（模型實作）

繼續閱讀

pytorch dataloader_基于pytorch的DeepLearning入門流程基于pytorch的DeepLearning學習筆記1、資料集2、定義模型3、訓練模型

pytorch checkpoint_[日常] PyTorch 預訓練模型，儲存，讀取和更新模型參數以及多 GPU 訓練模型

attention機制_【CV中的Attention機制】SENet中的SE子產品

adobe media encoder_Variational Auto-Encoder（變分自編碼）

pytorch實作attention_通過pytorch代碼來深入了解圖注意力網絡GAT

pytorch實作attention_HAN: 雙層Attention在文本分類中的應用

encoder decoder模型_Pytorch學習記錄-Transformer（資料預處理和模型結構）

encoder decoder模型_基于圖卷積路況預測的ETA深度模型