基于深度学习的文本分类
word2vec
word2vec模型的基本思想是对出现在上下文环境里的词进行预测。对于每一条输入文本,选取一个上下文窗口和一个中心词,并基于这个中心词去预测窗口里其他词出现的概率。因此,word2vec模型可以方便地从新增语料中学习到新增词的向量表达。
word2vec的主要思路:通过单词的上下文彼此预测,对应的两个算法分别为:
- Skip-grams(SG):预测上下文
- Continuous Bags of Words(CBOW):预测目标单词
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIyN2UmNwcTZykzMyMzNzQzM1MzMzAzMxMjMzQzMxMzNzAzMwMjMzAzMyMjZyUmNzYTZycjNkZTO2UmN0YzM3MjNlJzN2YmNjZjM2QmM3YDZ2kjNmJjZyE2MzcDM3QzN0cDO28CXiN2MmFTYlJzMxMmM2YmNlZTYwkjNmdjN5ITMzUWM3IGM0UTO2IzNl9CXt92YuQnblRnbvNmclNXdiVHa0l2Zu8WbhN2Lc9CX6MHc0RHaiojIsJye.jpg)
从直观上理解,Skip-Gram是给定input word来预测上下文,而CBOW是给定上下文来预测input word。
word2vec分为两个部分,第一部分为建立模型,第二部分是通过模型获取词向量。word2vec的整个建模过程与自编码器(auto-encoder)的思想类似,即基于训练数据构建一个神经网络,得到隐层的权重矩阵,把权重作为"word vectors"。
此外word2vec模型还提出了两种更加高效的训练方法:
- Hierarchical softmax
- Negative sampling
Skip-grams原理和网络结构
Skip-grams过程
假设有一个句子"the dog barked at the mailman"
- 首先选择一个词作为输入词,例如选取"dog"作为input word
- 定义参数skip_window,skip_window表示从当前input word的一侧 (左边或右边) 选取词的数量。如果设置skip_window=2,则选取左侧2个词和右侧2个词进入窗口,整个窗口大小span=4,获得窗口中的词 (包括input word) 就是 [‘The’,‘dog’,‘barked’,‘at’]。另一个参数为num_skips,代表从整个窗口选取多少个不同的词作为output word。当skip_window=2,num_skips=2时,我们将会得到两组 (input word, output word) 形式的训练数据,即 (‘dog’, ‘barked’),(‘dog’, ‘the’)
- 训练神经网络模型,基于这些训练数据输出一个概率分布,这个概率代表着词典中的每个词作为input word的output word的可能性。第二步在设置skip_window和num_skips=2的情况下获得了两组训练数据。假如先拿一组数据 (‘dog’, ‘barked’) 来训练神经网络,那么模型通过学习这个训练样本,会告诉我们词汇表中每个单词当’dog’作为input word时,其作为output word的可能性
也就是说模型的输出概率代表着词典中每个词有多大可能性跟input word同时出现。例如:假如向神经网络模型中输入一个单词“Soviet“,那么最终模型的输出概率中,像“Union”, ”Russia“这种相关词的概率将远高于像”watermelon“,”kangaroo“非相关词的概率。因为”Union“,”Russia“在文本中更大可能在”Soviet“的窗口中出现。
如下图所示,选定句子“The quick brown fox jumps over lazy dog”,设定窗口大小为2(window_size=2),即仅选输入词前后各两个词和输入词进行组合。蓝色方框代表input word,绿色方框代表位于窗口内的单词
input word和output word都会进行one-hot编码。而被one-hot编码以后大多数维度上都是0(实际上仅有一个位置为1),所以这个向量相当稀疏,直接计算会消耗相当大的计算资源,为了高效计算,模型仅会选择矩阵中对应的向量中维度值为1的索引行:
Skip-grams训练
由前文可知,word2vec模型是一个超级大的神经网络(权重矩阵规模非常大)。例如:假设有一个10000个单词的词汇表,如果我们想嵌入300维的词向量,那么模型的输入-隐层权重矩阵和隐层-输出层的权重矩阵都会有 10000 x 300 = 300万个权重,在如此庞大的神经网络中进行梯度下降是相当慢的。更糟糕的是,需要大量的训练数据来调整这些权重并且避免过拟合。
解决方案:
- 将常见的单词组合(word pairs)或者词组作为单个“words”来处理
- 对高频次单词进行抽样来减少训练样本的个数
- 对优化目标采用**“negative sampling”方法,这样每个训练样本的训练只会更新一小部分的模型权重**,从而降低计算负担
Word pairs and "phases"
一些单词组合(或者词组)的含义和拆开以后具有完全不同的意义。比如“Boston Globe”是一种报刊的名字,而单独的“Boston”和“Globe”这样单个的单词却表达不出这样的含义。因此,在文章中只要出现“Boston Globe”,我们就应该把它作为一个单独的词来生成其词向量,而不是将其拆开。同样的例子还有“New York”,“United Stated”等。
对高频词抽样
以前文中的例子“The quick brown fox jumps over the laze dog”为例,对于“the”这种常用高频单词的处理会存在下面两个问题:
- 当我们得到成对的单词训练样本时,(“fox”, “the”) 这样的训练样本并不会给我们提供关于“fox”更多的语义信息,因为“the”在每个单词的上下文中几乎都会出现
- 由于在文本中“the”这样的常用词出现概率很大,因此我们将会有大量的(”the“,…)这样的训练样本,而这些样本数量远远超过了我们学习“the”这个词向量所需的训练样本数
word2vec通过“抽样”模式来解决这种高频词问题。它的基本思想如下:在训练原始文本中遇到的每一个单词,都有一定概率被我们从文本中删掉,而这个被删除的概率与单词的频率有关。
对于单词 w w w, Z ( w ) Z(w) Z(w)为 w w w在所有语料中出现的频次,保留 w w w的概率 P ( w ) P(w) P(w)定义为:
P ( w ) = ( Z ( w ) 0.001 + 1 ) × 0.001 Z ( w ) P(w) = \left( \sqrt{\frac{Z(w)}{0.001}}+1 \right) \times \frac{0.001}{Z(w)} P(w)=(0.001Z(w)
+1)×Z(w)0.001.
Negative sampling
训练一个神经网络意味着要输入训练样本并且不断调整神经元的权重,从而不断提高对目标的准确预测。每当神经网络经过一个训练样本的训练,它的权重就会进行一次调整。所以,词典的大小决定了Skip-Gram神经网络将会拥有大规模的权重矩阵,所有的这些权重需要通过数以亿计的训练样本来进行调整,这是非常消耗计算资源的,并且实际中训练起来会非常慢。
负采样(negative sampling)解决了这个问题,它是用来提高训练速度并且改善所得到词向量的质量的一种方法。不同于原本每个训练样本更新所有的权重,负采样每次让一个训练样本仅仅更新一小部分的权重,这样就会降低梯度下降过程中的计算量。例如,当我们用训练样本 ( input word: “fox”, output word: “quick”) 来训练神经网络时,“ fox”和“quick”都是经过one-hot编码的。如果词典大小为10000,在输出层,期望输出的向量对应“quick”单词的那个神经元结点的输出为1,其余9999个神经元结点的输出为0。这9999个期望输出为0的神经元结点所对应的单词称为“negative” word。
当使用负采样时,我们将随机选择一小部分negative words(比如选5个negative words)来更新对应的权重,并且对所有“positive” word进行权重更新(在上面的例子中,这个单词指的是”quick“)。在论文中,作者指出指出对于小规模数据集,选择5-20个negative words会比较好,对于大规模数据集可以仅选择2-5个negative words。word2vec使用“一元模型分布(unigram distribution)”来选择“negative words”。一个单词被选作negative sample的概率跟它出现的频次有关,出现频次越高的单词越容易被选作negative words。每个单词被选为“negative words”的概率计算公式如下:
P ( w ) = Z ( w ) 3 / 4 ∑ Z ( w i ) 3 / 4 P(w) = \frac{Z(w)^{3/4}}{\sum Z(w_i)^{3/4} } P(w)=∑Z(wi)3/4Z(w)3/4
其中 Z ( w ) Z(w) Z(w)为单词 w w w出现的频次, 3 / 4 3/4 3/4是基于经验给出的。
在负采样的代码实现中,unigram table有一个包含了一亿个元素的数组,这个数组是由词汇表中每个单词的索引号填充的,并且这个数组有重复,也就是说有些单词会出现多次。单词的索引在这个数组中出现的次数 = 计算出的负采样概率 × 1 0 8 \times 10^8 ×108。有了这张表以后,进行负采样只需要在 0 − 1 0 8 0-10^8 0−108的范围内生成一个随机数,然后选择表中索引号为这个随机数的单词作为negative word即可。一个单词的负采样概率越大,那么它在这个表中出现的次数就越多,它被选中的概率就越大。
Hierarchical Softmax
关于Softmax的定义在上一篇文章中已经讨论过了,下面对Hierarchical Softmax的结构进行讨论。
霍夫曼树
- 输入:权值为 ( w 1 , w 2 , . . . , w n ) (w_1,w_2,...,w_n) (w1,w2,...,wn)的 n n n个节点
- 输出:对应的霍夫曼树
霍夫曼树的生成过程:
- 将 ( w 1 , w 2 , . . . , w n ) (w_1,w_2,...,w_n) (w1,w2,...,wn)看做是有 n n n棵树的森林,每个树仅有一个节点
- 在森林中选择根节点权值最小的两棵树进行合并,得到一个新的树,这两颗树分布作为新树的左右子树。新树的根节点权重为左右子树的根节点权重之和
- 将之前的根节点权值最小的两棵树从森林删除,并把新树加入森林
- 重复步骤 2 和 3 直到森林里只有一棵树为止
下面用一个具体的例子来说明霍夫曼树建立的过程,有 (a,b,c,d,e,f) 共6个节点,节点的权值分布为 (16,4,8,6,20,3)。首先是最小的b和f合并,得到的新树根节点权重是7。此时森林里5棵树,根节点权重分别是16,8,6,20,7。此时根节点权重最小的6,7合并,得到新子树,依次类推,最终得到下面的霍夫曼树。
接下来对霍夫曼树进行编码,由于权重高的叶子节点越靠近根节点,而权重低的叶子节点会远离根节点,这样编码后高权重节点编码值较短,而低权重值编码值较长。因此保证了树的带权路径最短,也符合信息论中的观点,即常用的词拥有更短的编码。一般对于一个霍夫曼树的节点(根节点除外),可以约定左子树编码为0,右子树编码为1。如上图,则可以得到c的编码是00。
Hierarchical Softmax过程
为了避免要计算所有词的softmax概率,word2vec采样了霍夫曼树来代替从隐藏层到输出softmax层的映射。
霍夫曼树的建立:
- 根据标签 (label) 和频率建立霍夫曼树 (label出现的频率越高,霍夫曼树的路径越短)
- 霍夫曼树中每一叶子结点代表一个label
Word2Vec参数说明
from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH)
- sentences:可以是一个list,对于大语料集,建议使用BrownCorpus,Text8Corpus或lineSentence构建
- size:是指特征向量的维度,默认为100
- alpha:是初始的学习速率,默认0.025,在训练过程中会线性地递减到min_alpha
- window:窗口大小,表示当前词与预测词在一个句子中的最大距离是多少
- min_count:可以对字典做截断,词频少于min_count次数的单词会被丢弃掉, 默认值为5
- max_vocab_size:设置词向量构建期间的RAM限制,设置成None则没有限制
- sample:高频词汇的随机降采样的配置阈值,默认为1e-3,范围是(0,1e-5)
- seed:用于随机数发生器。与初始化词向量有关
- workers:用于控制训练的并行数
- min_alpha:学习率的最小值
- sg: 用于设置训练算法,默认为0,对应CBOW算法;sg=1则采用skip-gram算法
- hs:如果为1则会采用hierarchica·softmax技巧。如果设置为0(默认),则使用negative sampling
- negative:如果>0,则会采用negativesampling,用于设置多少个noise words(一般是5-20)
- cbow_mean:如果为0,则采用上下文词向量的和,如果为1(default)则采用均值,只有使用CBOW的时候才起作用
- hashfxn: hash函数来初始化权重,默认使用python的hash函数
- iter: 迭代次数,默认为5
- trim_rule: 用于设置词汇表的整理规则,指定那些单词要留下,哪些要被删除。可以设置为None(min_count会被使用)
- sorted_vocab: 如果为1(默认),则在分配word index 的时候会先对单词基于频率降序排序
- batch_words:每一批的传递给线程的单词的数量,默认为10000
参数的选择与对比:
- skip-gram(训练速度慢,对罕见字有效),CBOW(训练速度快),一般选择Skip-gram模型
- 训练方法:Hierarchical Softmax(对罕见字有利),Negative Sampling(对常见字和低维向量有利)
- 欠采样频繁词可以提高结果的准确性和速度(1e-3~1e-5)
- Window大小:Skip-gram通常选择10左右,CBOW通常选择5左右
利用word2vec训练词向量
from gensim.models.word2vec import Word2Vec
sentences = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
model = Word2Vec(sentences,size=300)
model.save("test_01.model") # 保存模型,以便重用
print('词典的内容如下:')
print(model.wv.index2word)
# 求与某个词最相关的词
print('\n与57相似的词有:')
print(model.wv.most_similar('57'))
# 选出集合中不同类的词语
print('\n集合4464 486 6352 5619 2465 4802 1452中与其他词不同的词为:')
print(model.wv.doesnt_match('4464 486 6352 5619 2465 4802 1452'.split()))
# 求两个词之间的相似度
print('\n1124与2000的相似度为:')
print(model.wv.similarity('1124','2000'))
# 选两个集合之间的相似度
print('\n两篇文章的相似度为:')
print(model.wv.n_similarity('4464 486 6352 5619 2465 4802 1452 3137 5778'.split(),
'3646 3055 3055 2490 4659 6065 3370 5814 2465'.split()))
加载模型:
model = gensim.models.Word2Vec.load('test_01.model')
追加训练数据:
model = gensim.models.Word2Vec.load('test_01.model')
model.train(more_sentences)
TextCNN
TextCNN利用CNN(卷积神经网络)进行文本特征抽取,不同大小的卷积核分别抽取n-gram特征,卷积计算出的特征图经过MaxPooling保留最大的特征值,然后将拼接成一个向量作为文本的表示。这里我们基于TextCNN原始论文的设定,分别采用了100个大小为2,3,4的卷积核,最后得到的文本向量大小为100*3=300维。
TextRNN
TextRNN利用RNN(循环神经网络)进行文本特征抽取,由于文本本身是一种序列,而LSTM天然适合建模序列数据。TextRNN将句子中每个词的词向量依次输入到双向双层LSTM,分别将两个方向最后一个有效位置的隐藏层拼接成一个向量作为文本的表示。
使用HAN用于文本分类
Hierarchical Attention Network for Document Classification(HAN)基于层级注意力,在单词和句子级别分别编码并基于注意力获得文档的表示,然后经过Softmax进行分类。其中word encoder的作用是获得句子的表示,可以替换为上节提到的TextCNN和TextRNN,也可以替换为下节中的BERT。
TextCNN的实现
参考代码:基于tensorflow的TextCNN
引入需要的库:
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import time
import datetime
读入数据:
train_df = pd.read_csv('./data/train_set.csv', sep='\t', nrows=15000)
train_df['content'] = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
dfx = train_df['content']
dfy = train_df['label']
from tensorflow.contrib import learn
max_document_length = 500
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length,tokenizer_fn=list) # 建立词典
x = np.array(list(vocab_processor.fit_transform(dfx))) # 把文本用词典中的索引表示
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
lb.fit(dfy)
y=lb.transform(dfy) # 将标签矩阵二值化 每一行都是类似[0,1,0,0...0]的形式,为1的位置对应着第几类
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]
dev_sample_index = -1 * int(0.1 * float(len(y)))
x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:] # 分成训练数据和验证数据
y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]
del train_df, dfx, dfy, x, y, x_shuffled, y_shuffled
构建TextCNN:
# 定义TextCNN类
class TextCNN(object):
"""
A CNN for text classification.
Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
"""
def __init__(
self, sequence_length, num_classes, vocab_size,
embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0): # 初始化
# sequence_length 句子长度
# num_classes 类别数
# vocab_size 词典大小
# embedding_size 词嵌入维度,即把一个单词表示成embedding_size那么长的向量
# filter_sizes 卷积用的过滤器的大小 ([3, 4, 5])
# num_filters 每个尺寸的过滤器的数量
# l2_reg_lambda l2正则化的权值
# input, output 和 dropout 的占位符 (用于tf的结构化表示)
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
# placeholder的第二个变量的第一维为batch_size,None意味着该维度可为任意值,使用None将该维度交给网络自由决定
l2_loss = tf.constant(0.0)
# Embedding层 将词向量使用更多维的向量表示
with tf.device('/cpu:0'), tf.name_scope("embedding"):
self.W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
# tf.device(’/cpu:0’)将Embedding操作交给cpu执行
# 默认情况下TensorFlow会将该操作交给gpu执行(前提是有gpu),但是当前embedding在gpu中执行会报错
# tf.name_scope(“embedding”):本操作将embedding加入到命名空间(name scope)中
# 命名空间将所有操作加入到名为embedding的顶层节点中,因此在使用TensorBoard进行网络可视化时能有一个良好的层次结构
# W是在训练中学习的词嵌入矩阵,使用随机均匀分布来初始化(即神经网络的权重)
# tf.nn.embedding_lookup创建实际的嵌入操作。嵌入操作的结果是形状为[None,sequence_length,embedding_size]的三维张量
# embedded_chars_expanded 的尺寸为 [None,sequence_length,embedding_size,1]
# 构建卷积层,然后进行max-pooling
pooled_outputs = [] # 池化输出结果
for i, filter_size in enumerate(filter_sizes): # 遍历多个filter_size
with tf.name_scope("conv-maxpool-%s" % filter_size):
# 卷积层
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(
self.embedded_chars_expanded,
W,
strides=[1, 1, 1, 1],
padding="VALID",
name="conv") # tf.nn.conv2d 是tf进行卷积操作的函数
# 卷积之后的结果加上bisa(偏置量)传入relu函数(神经网络的激活函数)
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# Maxpooling
pooled = tf.nn.max_pool(
h,
ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
padding='VALID',
name="pool")
pooled_outputs.append(pooled)
# 把所有池化的结果拼在一起
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(pooled_outputs, 3)
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
# h_pool_flat的大小为 [batch_size,num_filters_total] ,在tf.reshape中使用-1可以告诉TensorFlow在可能的情况下平坦化维度
# Dropout层
# Dropout的想法很简单:Dropout层随机“禁用”神经元的一部分(不更新权值也不参加神经网络的计算),这可以防止过拟合
# 神经元中启用的比例是由初始化参数中的dropout_keep_prob决定的,在测试集上运行时定义为1(禁用Dropout)
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
# 评估和预测
with tf.name_scope("output"):
W = tf.get_variable(
"W",
shape=[num_filters_total, num_classes],
initializer=tf.contrib.layers.xavier_initializer()) # W为训练得到的权重
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") # b是偏置量
l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") # 计算 X*W + b
self.predictions = tf.argmax(self.scores, 1, name="predictions") # 预测,x的类别定义为scores的最大分量对应的类别
# 计算平均损失函数
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
# 计算准确率
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
生成batches:
# 把data分成若干个batch
# 一个batch含有batch_size条数据,在一个batch上进行训练并更新权重称为一个step
# 将data循环完一次称为一个epoch,一个epoch需要执行num_batches_per_epoch个batch
# num_epochs指要进行多少次epoch
def batch_iter(data, batch_size, num_epochs, shuffle=True):
data = np.array(data)
data_size = len(data)
num_batches_per_epoch = int((len(data)-1)/batch_size) + 1
for epoch in range(num_epochs):
# 每个epoch都把数据打乱顺序
if shuffle:
shuffle_indices = np.random.permutation(np.arange(data_size))
shuffled_data = data[shuffle_indices]
else:
shuffled_data = data
for batch_num in range(num_batches_per_epoch):
start_index = batch_num * batch_size
end_index = min((batch_num + 1) * batch_size, data_size)
yield shuffled_data[start_index:end_index] # yield返回一个generator,该对象具有next()方法
模型参数:
# 模型参数
# flags是tf里调用参数的一种方法
tf.flags.DEFINE_integer("embedding_dim", 64, "Dimensionality of character embedding (default: 128)")
tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
tf.flags.DEFINE_integer("num_filters", 64, "Number of filters per filter size (default: 128)")
tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularization lambda (default: 0.0)")
# 训练参数
tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
tf.flags.DEFINE_integer("num_epochs", 20, "Number of training epochs (default: 200)")
tf.flags.DEFINE_integer("evaluate_every", 50, "Evaluate model on dev set after this many steps (default: 100)") # 每隔几个step就在验证集上进行验证
tf.flags.DEFINE_integer("checkpoint_every", 50, "Save model after this many steps (default: 100)") # 每隔几个step就保存模型
tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)") # 最多储存几个checkpoint
# 系统参数
tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
FLAGS = tf.flags.FLAGS
定义训练函数:
# 定义训练函数
def train(x_train, y_train, vocab_processor, x_dev, y_dev):
# 显式创建graph便于训练结束后释放资源
with tf.Graph().as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
cnn = TextCNN(
sequence_length=x_train.shape[1],
num_classes=y_train.shape[1],
vocab_size=len(vocab_processor.vocabulary_),
embedding_size=FLAGS.embedding_dim,
filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
num_filters=FLAGS.num_filters,
l2_reg_lambda=FLAGS.l2_reg_lambda)
# 使用Adam优化器极小化损失函数
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
# 汇总梯度相关信息
grad_summaries = []
for g, v in grads_and_vars:
if g is not None:
grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g) # tf.summary.scalar():添加标量统计结果
sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
# tf.summary.histogram():添加任意shape的Tensor,统计这个Tensor的取值分布
grad_summaries.append(grad_hist_summary)
grad_summaries.append(sparsity_summary)
grad_summaries_merged = tf.summary.merge(grad_summaries)
# 获取运行时间,定义输出路径
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))
# 汇总损失函数和精确度
loss_summary = tf.summary.scalar("loss", cnn.loss)
acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)
# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
# Dev summaries
dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
# 创建Checkpoint保存路径
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
vocab_processor.save(os.path.join(out_dir, "vocab"))
# 初始化所有变量
sess.run(tf.global_variables_initializer())
# 一个step上执行的训练步骤
def train_step(x_batch, y_batch):
"""
A single training step
"""
feed_dict = {
cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
} # 传入参数
_, step, summaries, loss, accuracy = sess.run(
[train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
feed_dict)
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
train_summary_writer.add_summary(summaries, step)
# 一个step上执行的验证步骤
def dev_step(x_batch, y_batch, writer=None):
"""
Evaluates model on a dev set
"""
feed_dict = {
cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: 1.0
}
step, summaries, loss, accuracy = sess.run(
[global_step, dev_summary_op, cnn.loss, cnn.accuracy],
feed_dict)
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
if writer:
writer.add_summary(summaries, step)
# 生成batches
batches = batch_iter(
list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
for batch in batches:
x_batch, y_batch = zip(*batch)
train_step(x_batch, y_batch)
current_step = tf.train.global_step(sess, global_step)
if current_step % FLAGS.evaluate_every == 0:
print("\nEvaluation:")
dev_step(x_dev, y_dev, writer=dev_summary_writer)
print("")
if current_step % FLAGS.checkpoint_every == 0:
path = saver.save(sess, checkpoint_prefix, global_step=current_step)
print("Saved model checkpoint to {}\n".format(path))
进行训练:
train(x_train, y_train, vocab_processor, x_dev, y_dev)
运行输入如下日志信息:
模型的权值、网络结构等相关信息都存储在日志输出的路径中,如果想利用训练好的模型对新的数据进行预测只要从该路径加载即可。
定义测试数据:(注意测试数据和训练数据的词向量表示应该是基于同一个词典的)
train_df['content'] = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
dfx = train_df['content']
y_test = train_df['label'][12000:15000]
max_document_length = 500
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length,tokenizer_fn=list)
x_test = np.array(list(vocab_processor.fit_transform(dfx)))[12000:15000]
导入训练好的模型进行测试:
# 导入之前训练好的模型进行测试
graph = tf.Graph()
with graph.as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
saver = tf.train.import_meta_graph(r'C:\Users\modiker\NLP学习\runs\1596162730\checkpoints\model-4200.meta')
saver.restore(sess,tf.train.latest_checkpoint(r'C:\Users\modiker\NLP学习\runs\1596162730\checkpoints'))
input_x = graph.get_operation_by_name("input_x").outputs[0]
dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
predictions = graph.get_operation_by_name("output/predictions").outputs[0]
batches = batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False)
all_predictions = []
for x_test_batch in batches:
batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
all_predictions = np.concatenate([all_predictions, batch_predictions])
# Print accuracy if y_test is defined
if y_test is not None:
correct_predictions = float(sum(all_predictions == y_test))
print("Total number of test examples: {}".format(len(y_test)))
print("Accuracy: {:g}".format(correct_predictions/float(len(y_test))))
运行结果如下:
TextRNN的实现
参考代码:基于tensorflow的TextRNN
基本上和TextCNN的实现是同样的构建过程
读入数据:
train_df = pd.read_csv('./data/train_set.csv', sep='\t', nrows=17000)
train_df['content'] = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
dfx = train_df['content']
dfy = train_df['label']
from tensorflow.contrib import learn
max_document_length = 500
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length,tokenizer_fn=list)
x = np.array(list(vocab_processor.fit_transform(dfx)))[:15000]
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
lb.fit(dfy)
y=lb.transform(dfy)[:15000]
# 留下2000条数据作为测试数据
test_x = np.array(list(vocab_processor.fit_transform(dfx)))[15000:]
test_y = dfy[15000:]
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]
dev_sample_index = -1 * int(0.1 * float(len(y)))
x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:]
y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]
del train_df, dfx, dfy, x, y, x_shuffled, y_shuffled
构建TextRNN:
# 构建TextRNN网络结构
class TextRCNN:
def __init__(self, sequence_length, num_classes, vocab_size, word_embedding_size, context_embedding_size,
cell_type, hidden_size, l2_reg_lambda=0.0):
#olders for input, output and dropout
self.input_text = tf.placeholder(tf.int32, shape=[None, sequence_length], name='input_text')
self.input_y = tf.placeholder(tf.float32, shape=[None, num_classes], name='input_y')
self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')
l2_loss = tf.constant(0.0)
text_length = self._length(self.input_text)
# Embeddings
with tf.device('/cpu:0'), tf.name_scope("embedding"):
self.W_text = tf.Variable(tf.random_uniform([vocab_size, word_embedding_size], -1.0, 1.0), name="W_text")
self.embedded_chars = tf.nn.embedding_lookup(self.W_text, self.input_text)
# Bidirectional(Left&Right) Recurrent Structure
with tf.name_scope("bi-rnn"):
fw_cell = self._get_cell(context_embedding_size, cell_type)
fw_cell = tf.nn.rnn_cell.DropoutWrapper(fw_cell, output_keep_prob=self.dropout_keep_prob)
bw_cell = self._get_cell(context_embedding_size, cell_type)
bw_cell = tf.nn.rnn_cell.DropoutWrapper(bw_cell, output_keep_prob=self.dropout_keep_prob)
(self.output_fw, self.output_bw), states = tf.nn.bidirectional_dynamic_rnn(cell_fw=fw_cell,
cell_bw=bw_cell,
inputs=self.embedded_chars,
sequence_length=text_length,
dtype=tf.float32)
with tf.name_scope("context"):
shape = [tf.shape(self.output_fw)[0], 1, tf.shape(self.output_fw)[2]]
self.c_left = tf.concat([tf.zeros(shape), self.output_fw[:, :-1]], axis=1, name="context_left")
self.c_right = tf.concat([self.output_bw[:, 1:], tf.zeros(shape)], axis=1, name="context_right")
with tf.name_scope("word-representation"):
self.x = tf.concat([self.c_left, self.embedded_chars, self.c_right], axis=2, name="x")
embedding_size = 2*context_embedding_size + word_embedding_size
with tf.name_scope("text-representation"):
W2 = tf.Variable(tf.random_uniform([embedding_size, hidden_size], -1.0, 1.0), name="W2")
b2 = tf.Variable(tf.constant(0.1, shape=[hidden_size]), name="b2")
self.y2 = tf.tanh(tf.einsum('aij,jk->aik', self.x, W2) + b2)
with tf.name_scope("max-pooling"):
self.y3 = tf.reduce_max(self.y2, axis=1)
with tf.name_scope("output"):
W4 = tf.get_variable("W4", shape=[hidden_size, num_classes], initializer=tf.contrib.layers.xavier_initializer())
b4 = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b4")
l2_loss += tf.nn.l2_loss(W4)
l2_loss += tf.nn.l2_loss(b4)
self.logits = tf.nn.xw_plus_b(self.y3, W4, b4, name="logits")
self.predictions = tf.argmax(self.logits, 1, name="predictions")
# Calculate mean cross-entropy loss
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.input_y)
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
# Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, axis=1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name="accuracy")
@staticmethod
def _get_cell(hidden_size, cell_type):
if cell_type == "vanilla":
return tf.nn.rnn_cell.BasicRNNCell(hidden_size)
elif cell_type == "lstm":
return tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
elif cell_type == "gru":
return tf.nn.rnn_cell.GRUCell(hidden_size)
else:
print("ERROR: '" + cell_type + "' is a wrong cell type !!!")
return None
# Length of the sequence data
@staticmethod
def _length(seq):
relevant = tf.sign(tf.abs(seq))
length = tf.reduce_sum(relevant, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
# Extract the output of last cell of each sequence
# Ex) The movie is good -> length = 4
# output = [ [1.314, -3.32, ..., 0.98]
# [0.287, -0.50, ..., 1.55]
# [2.194, -2.12, ..., 0.63]
# [1.938, -1.88, ..., 1.31]
# [ 0.0, 0.0, ..., 0.0]
# ...
# [ 0.0, 0.0, ..., 0.0] ]
# The output we need is 4th output of cell, so extract it.
@staticmethod
def last_relevant(seq, length):
batch_size = tf.shape(seq)[0]
max_length = int(seq.get_shape()[1])
input_size = int(seq.get_shape()[2])
index = tf.range(0, batch_size) * max_length + (length - 1)
flat = tf.reshape(seq, [-1, input_size])
return tf.gather(flat, index)
生成batches:
同TextCNN
模型参数:
# Model Hyperparameters
tf.flags.DEFINE_string("cell_type", "vanilla", "Type of RNN cell. Choose 'vanilla' or 'lstm' or 'gru' (Default: vanilla)")
tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings")
tf.flags.DEFINE_integer("word_embedding_dim", 128, "Dimensionality of word embedding (Default: 300)")
tf.flags.DEFINE_integer("context_embedding_dim", 128, "Dimensionality of context embedding(= RNN state size) (Default: 512)")
tf.flags.DEFINE_integer("hidden_size", 64, "Size of hidden layer (Default: 512)")
tf.flags.DEFINE_float("dropout_keep_prob", 0.7, "Dropout keep probability (Default: 0.7)")
tf.flags.DEFINE_float("l2_reg_lambda", 0.5, "L2 regularization lambda (Default: 0.5)")
# Training parameters
tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (Default: 64)")
tf.flags.DEFINE_integer("num_epochs", 5, "Number of training epochs (Default: 10)")
tf.flags.DEFINE_integer("display_every", 10, "Number of iterations to display training info.")
tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps")
tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps")
tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store")
tf.flags.DEFINE_float("learning_rate", 1e-3, "Which learning rate to start with. (Default: 1e-3)")
# Misc Parameters
tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
FLAGS = tf.flags.FLAGS
FLAGS._parse_flags()
print("\nParameters:")
for attr, value in sorted(FLAGS.__flags.items()):
print("{} = {}".format(attr.upper(), value))
print("")
定义训练函数:
def train(x_train, y_train, vocab_processor, x_dev, y_dev):
with tf.Graph().as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
rcnn = TextRCNN(
sequence_length=x_train.shape[1],
num_classes=y_train.shape[1],
vocab_size=len(vocab_processor.vocabulary_),
word_embedding_size=FLAGS.word_embedding_dim,
context_embedding_size=FLAGS.context_embedding_dim,
cell_type=FLAGS.cell_type,
hidden_size=FLAGS.hidden_size,
l2_reg_lambda=FLAGS.l2_reg_lambda
)
# Define Training procedure
global_step = tf.Variable(0, name="global_step", trainable=False)
train_op = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(rcnn.loss, global_step=global_step)
# Output directory for models and summaries
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))
# Summaries for loss and accuracy
loss_summary = tf.summary.scalar("loss", rcnn.loss)
acc_summary = tf.summary.scalar("accuracy", rcnn.accuracy)
# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, acc_summary])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
# Dev summaries
dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
# Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
# Write vocabulary
vocab_processor.save(os.path.join(out_dir, "text_vocab"))
# Initialize all variables
sess.run(tf.global_variables_initializer())
# Pre-trained word2vec
if FLAGS.word2vec:
# initial matrix with random uniform
initW = np.random.uniform(-0.25, 0.25, (len(vocab_processor.vocabulary_), FLAGS.word_embedding_dim))
# load any vectors from the word2vec
print("Load word2vec file {0}".format(FLAGS.word2vec))
with open(FLAGS.word2vec, "rb") as f:
header = f.readline()
vocab_size, layer1_size = map(int, header.split())
binary_len = np.dtype('float32').itemsize * layer1_size
for line in range(vocab_size):
word = []
while True:
ch = f.read(1).decode('latin-1')
if ch == ' ':
word = ''.join(word)
break
if ch != '\n':
word.append(ch)
idx = vocab_processor.vocabulary_.get(word)
if idx != 0:
initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')
else:
f.read(binary_len)
sess.run(rcnn.W_text.assign(initW))
print("Success to load pre-trained word2vec model!\n")
# Generate batches
batches = batch_iter(
list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
# Training loop. For each batch...
for batch in batches:
x_batch, y_batch = zip(*batch)
# Train
feed_dict = {
rcnn.input_text: x_batch,
rcnn.input_y: y_batch,
rcnn.dropout_keep_prob: FLAGS.dropout_keep_prob
}
_, step, summaries, loss, accuracy = sess.run(
[train_op, global_step, train_summary_op, rcnn.loss, rcnn.accuracy], feed_dict)
train_summary_writer.add_summary(summaries, step)
# Training log display
if step % FLAGS.display_every == 0:
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
# Evaluation
if step % FLAGS.evaluate_every == 0:
print("\nEvaluation:")
feed_dict_dev = {
rcnn.input_text: x_dev,
rcnn.input_y: y_dev,
rcnn.dropout_keep_prob: 1.0
}
summaries_dev, loss, accuracy = sess.run(
[dev_summary_op, rcnn.loss, rcnn.accuracy], feed_dict_dev)
dev_summary_writer.add_summary(summaries_dev, step)
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}\n".format(time_str, step, loss, accuracy))
# Model checkpoint
if step % FLAGS.checkpoint_every == 0:
path = saver.save(sess, checkpoint_prefix, global_step=step)
print("Saved model checkpoint to {}\n".format(path))
进行训练:
train(x_train, y_train, vocab_processor, x_dev, y_dev)
参考文章:
[NLP] 秒懂词向量Word2vec的本质