天天看点

Paper:《Generating Sequences With Recurrent Neural Networks》的翻译和解读(二)

3 Text Prediction  文本预测

Text data is discrete, and is typically presented to neural networks using ‘onehot’ input vectors. That is, if there are K text classes in total, and class k is fed in at time t, then xt is a length K vector whose entries are all zero except for the k th, which is one. Pr(xt+1|yt) is therefore a multinomial distribution, which can be naturally parameterised by a softmax function at the output layer:

文本数据是离散的,通常使用“onehot”输入向量呈现给神经网络。也就是说,如果总共有K个文本类,而类K是在t时刻输入的,那么xt就是一个长度为K的向量,除了第K项是1外,其他项都是0。因此,Pr(xt+1|yt)是一个多项分布,可以通过输出层的softmax函数自然参数化:

The only thing that remains to be decided is which set of classes to use. In most cases, text prediction (usually referred to as language modelling) is performed at the word level. K is therefore the number of words in the dictionary. This can be problematic for realistic tasks, where the number of words (including variant conjugations, proper names, etc.) often exceeds 100,000. As well as requiring many parameters to model, having so many classes demands a huge amount of training data to adequately cover the possible contexts for the words. In the case of softmax models, a further difficulty is the high computational cost of evaluating all the exponentials during training (although several methods have been to devised make training large softmax layers more efficient, including tree-based models [25, 23], low rank approximations [27] and stochastic derivatives [26]). Furthermore, word-level models are not applicable to text data containing non-word strings, such as multi-digit numbers or web addresse.

唯一需要决定的是使用哪一组类。在大多数情况下,文本预测(通常称为语言建模)是在单词级执行的。因此K是字典中的单词数。这对于实际的任务来说是有问题的,因为单词的数量(包括不同的词形变化、专有名称等)常常超过100,000。除了需要许多参数进行建模外,拥有如此多的类还需要大量的训练数据来充分覆盖单词的可能上下文。在softmax模型中,另一个困难是在训练期间评估所有指数的计算成本很高(尽管已经设计了几种方法来提高训练大型softmax层的效率,包括基于树的模型[25,23]、低秩近似[27]和随机导数[26])。此外,单词级模型不适用于包含非单词字符串的文本数据,如多位数字或web地址。

Character-level language modelling with neural networks has recently been considered [30, 24], and found to give slightly worse performance than equivalent word-level models. Nonetheless, predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings. In general, the experiments in this paper aim to predict at the finest granularity found in the data, so as to maximise the generative flexibility of the networ

使用神经网络的字符级语言建模最近被考虑[30,24],并发现其性能略差于等价的字级模型。尽管如此,从序列生成的角度来看,一次预测一个字符更有趣,因为它允许网络创建新的单词和字符串。总的来说,本文的实验旨在以数据中发现的最细粒度进行预测,从而最大限度地提高网络的生成灵活性

3.1 Penn Treebank Experiments Penn Treebank实验

The first set of text prediction experiments focused on the Penn Treebank portion of the Wall Street Journal corpus [22]. This was a preliminary study whose main purpose was to gauge the predictive power of the network, rather than to generate interesting sequences.

第一组文本预测实验集中在《华尔街日报》语料库[22]的宾夕法尼亚河岸部分。这是一项初步研究,其主要目的是评估网络的预测能力,而不是生成有趣的序列。

Although a relatively small text corpus (a little over a million words in total), the Penn Treebank data is widely used as a language modelling benchmark. The training set contains 930,000 words, the validation set contains 74,000 words and the test set contains 82,000 words. The vocabulary is limited to 10,000 words, with all other words mapped to a special ‘unknown word’ token. The end-ofsentence token was included in the input sequences, and was counted in the sequence loss. The start-of-sentence marker was ignored, because its role is already fulfilled by the null vectors that begin the sequences (c.f. Section 2). 尽管文本语料库相对较小(总共超过100万单词),Penn Treebank的数据被广泛用作语言建模的基准。训练集包含93万字,验证集包含7.4万字,测试集包含8.2万字。词汇表被限制为10,000个单词,所有其他单词都映射到一个特殊的“未知单词”标记。语句结束标记被包含在输入序列中,并被计算在序列丢失中。句子开始标记被忽略,因为它的作用已经由开始序列的空向量完成(c.f. . Section 2)。

The experiments compared the performance of word and character-level LSTM predictors on the Penn corpus. In both cases, the network architecture was a single hidden layer with 1000 LSTM units. For the character-level network the input and output layers were size 49, giving approximately 4.3M weights in total, while the word-level network had 10,000 inputs and outputs and around 54M weights. The comparison is therefore somewhat unfair, as the word-level network had many more parameters. However, as the dataset is small, both networks were easily able to overfit the training data, and it is not clear whether the character-level network would have benefited from more weights. All networks were trained with stochastic gradient descent, using a learn rate of 0.0001 and a momentum of 0.99. The LSTM derivates were clipped in the range [−1, 1] (c.f. Section 2.1).

实验比较了词级和字符级LSTM预测器在Penn语料库上的性能。在这两种情况下,网络架构都是一个包含1000 LSTM单元的单一隐含层。对于字符级网络,输入和输出层的大小为49,总共给出了大约430万的权重,而单词级网络有10,000个输入和输出,以及大约54M的权重。因此,这种比较有点不公平,因为单词级网络有更多的参数。然而,由于数据集较小,这两个网络都很容易对训练数据进行过度拟合,而且还不清楚字符级网络是否会从更大的权重中受益。所有网络均采用随机梯度下降训练,学习率为0.0001,动量为0.99。LSTM衍生物被限制在[−1,1]范围内(c.f。2.1节)。

Neural networks are usually evaluated on test data with fixed weights. For prediction problems however, where the inputs are the targets, it is legitimate to allow the network to adapt its weights as it is being evaluated (so long as it only sees the test data once). Mikolov refers to this as dynamic evaluation. Dynamic evaluation allows for a fairer comparison with compression algorithms, for which there is no division between training and test sets, as all data is only predicted once.

神经网络的评价通常采用固定权值的试验数据。然而,对于输入是目标的预测问题,允许网络在评估时调整其权重是合理的(只要它只看到测试数据一次)。Mikolov称之为动态评估。动态评估允许与压缩算法进行更公平的比较,压缩算法不需要划分训练集和测试集,因为所有数据只预测一次。

Table 1: Penn Treebank Test Set Results. ‘BPC’ is bits-per-character. ‘Error’ is next-step classification error rate, for either characters or words.

表1:Penn Treebank测试集的结果。BPC的bits-per-character。“错误”是下一步的分类错误率,不管是字符还是单词。

Since both networks overfit the training data, we also experiment with two types of regularisation: weight noise [18] with a std. deviation of 0.075 applied to the network weights at the start of each training sequence, and adaptive weight noise [8], where the variance of the noise is learned along with the weights using a Minimum description Length (or equivalently, variational inference) loss function. When weight noise was used, the network was initialised with the final weights of the unregularised network. Similarly, when adaptive weight noise was used, the weights were initialised with those of the network trained with weight noise. We have found that retraining with iteratively increased regularisation is considerably faster than training from random weights with regularisation. Adaptive weight noise was found to be prohibitively slow for the word-level network, so it was regularised with fixed-variance weight noise only. One advantage of adaptive weight is that early stopping is not needed (the network can safely be stopped at the point of minimum total ‘description length’ on the training data). However, to keep the comparison fair, the same training, validation and test sets were used for all experiments.

因为网络overfit训练数据,我们也尝试两种regularisation:体重噪声[18]std.偏差为0.075应用于网络权值在每个训练序列的开始,体重和自适应噪声[8],噪声的方差在哪里学习使用最小描述长度随着重量损失函数(或等价变分推理)。当使用权值噪声时,网络被初始化为非正则化网络的最终权值。类似地,当使用自适应权值噪声时,权值与使用权值噪声训练的网络的权值初始化。我们发现,用迭代增加的正则化进行再训练要比用正则化进行随机加权训练快得多。自适应权值噪声在词级网络中速度非常慢,因此只能用固定方差权值噪声对其进行正则化。自适应权值的一个优点是不需要提前停止(网络可以安全地在训练数据上的最小总“描述长度”处停止)。然而,为了保持比较的公平性,所有的实验都使用相同的训练、验证和测试集。

The results are presented with two equivalent metrics: bits-per-character (BPC), which is the average value of − log2 Pr(xt+1|yt) over the whole test set; and perplexity which is two to the power of the average number of bits per word (the average word length on the test set is about 5.6 characters, so perplexity ≈ 2 5.6BP C ). Perplexity is the usual performance measure for language modelling.

结果用两个等价的度量来表示:每个字符的比特数(BPC),这是−log2 Pr(xt+1|yt)在整个测试集上的平均值;perplexity为每个单词平均位数的2次方(测试集上的平均单词长度约为5.6个字符,所以perplexity≈2 5.6 bp C)。Perplexity是语言建模的常用性能度量。

Table 1 shows that the word-level RNN performed better than the characterlevel network, but the gap appeared to close when regularisation is used. Overall the results compare favourably with those collected in Tomas Mikolov’s thesis [23]. For example, he records a perplexity of 141 for a 5-gram with KeyserNey smoothing, 141.8 for a word level feedforward neural network, 131.1 for the state-of-the-art compression algorithm PAQ8 and 123.2 for a dynamically evaluated word-level RNN. However by combining multiple RNNs, a 5-gram and a cache model in an ensemble, he was able to achieve a perplexity of 89.4. Interestingly, the benefit of dynamic evaluation was far more pronounced here than in Mikolov’s thesis (he records a perplexity improvement from 124.7 to 123.2 with word-level RNNs). This suggests that LSTM is better at rapidly adapting to new data than ordinary RNNs.

表1显示,单词级RNN的性能优于字符级网络,但在使用正则化时,这种差距似乎缩小了。总的来说,这些结果与Tomas Mikolov的论文[23]中收集到的结果相比是令人满意的。例如,他记录了5克KeyserNey平滑算法的perplexity为141,单词级前馈神经网络的perplexity为141.8,最先进的压缩算法PAQ8的perplexity为131.1,动态评估单词级RNN的perplexity为123.2。然而,通过将多个RNNs、一个5克重的内存和一个缓存模型结合在一起,他可以得到一个令人困惑的89.4。有趣的是,动态评估的好处在这里比在Mikolov的论文中更明显(他记录了一个复杂的改进,从124.7到123.2的单词级RNNs)。这表明LSTM在快速适应新数据方面优于普通的rns。

3.2 Wikipedia Experiments  维基百科的实验

In 2006 Marcus Hutter, Jim Bowery and Matt Mahoney organised the following challenge, commonly known as Hutter prize [17]: to compress the first 100 million bytes of the complete English Wikipedia data (as it was at a certain time on March 3rd 2006) to as small a file as possible. The file had to include not only the compressed data, but also the code implementing the compression algorithm. Its size can therefore be considered a measure of the minimum description length [13] of the data using a two part coding scheme.

在2006年,Marcus Hutter, Jim Bowery和Matt Mahoney组织了如下的挑战,通常被称为Hutter奖[17]:压缩完整的英文维基百科数据的前1亿字节(在2006年3月3日的某个时间)到一个尽可能小的文件。该文件不仅要包含压缩数据,还要包含实现压缩算法的代码。因此,它的大小可以被认为是使用两部分编码方案的数据的最小描述长度[13]的度量。

Wikipedia data is interesting from a sequence generation perspective because it contains not only a huge range of dictionary words, but also many character sequences that would not be included in text corpora traditionally used for language modelling. For example foreign words (including letters from nonLatin alphabets such as Arabic and Chinese), indented XML tags used to define meta-data, website addresses, and markup used to indicate page formatting such as headings, bullet points etc. An extract from the Hutter prize dataset is shown in Figs. 3 and 4.

从序列生成的角度来看,Wikipedia的数据非常有趣,因为它不仅包含大量的字典单词,而且还包含许多字符序列,而这些字符序列不会包含在传统用于语言建模的文本语料库中。例如,外来词(包括来自非拉丁字母的字母,如阿拉伯语和汉语)、用于定义元数据的缩进XML标记、网站地址和用于指示页面格式(如标题、项目符号等)的标记。Hutter prize数据集的摘录如图3和图4所示。  

The first 96M bytes in the data were evenly split into sequences of 100 bytes and used to train the network, with the remaining 4M were used for validation. The data contains a total of 205 one-byte unicode symbols. The total number of characters is much higher, since many characters (especially those from nonLatin languages) are defined as multi-symbol sequences. In keeping with the principle of modelling the smallest meaningful units in the data, the network predicted a single byte at a time, and therefore had size 205 input and output layers. 数据中的前9600万字节被均匀地分割成100字节的序列,用于训练网络,剩下的400万字节用于验证。数据总共包含205个一字节的unicode符号。字符的总数要高得多,因为许多字符(特别是来自非拉丁语言的字符)被定义为多符号序列。根据对数据中有意义的最小单位建模的原则,网络每次预测一个字节,因此大小为205个输入和输出层。

Wikipedia contains long-range regularities, such as the topic of an article, which can span many thousand words. To make it possible for the network to capture these, its internal state (that is, the output activations ht of the hidden layers, and the activations ct of the LSTM cells within the layers) were only reset every 100 sequences. Furthermore the order of the sequences was not shuffled during training, as it usually is for neural networks. The network was therefore able to access information from up to 10K characters in the past when making predictions. The error terms were only backpropagated to the start of each 100 byte sequence, meaning that the gradient calculation was approximate. This form of truncated backpropagation has been considered before for RNN language modelling [23], and found to speed up training (by reducing the sequence length and hence increasing the frequency of stochastic weight updates) without affecting the network’s ability to learn long-range dependencies.

维基百科包含长期的规律,比如一篇文章的主题,可以跨越数千个单词。为了使网络能够捕获这些信息,其内部状态(即隐含层的输出激活ht和层内LSTM细胞的激活ct)每100个序列才重置一次。此外,在训练过程中,序列的顺序没有像通常的神经网络那样被打乱。因此,在过去进行预测时,该网络能够访问多达10K个字符的信息。错误项仅反向传播到每个100字节序列的开始处,这意味着梯度计算是近似的。RNN语言建模[23]之前就考虑过这种截断反向传播的形式,并发现它可以在不影响网络学习长期依赖关系的情况下加速训练(通过减少序列长度,从而增加随机权值更新的频率)。  

A much larger network was used for this data than the Penn data (reflecting the greater size and complexity of the training set) with seven hidden layers of 700 LSTM cells, giving approximately 21.3M weights. The network was trained with stochastic gradient descent, using a learn rate of 0.0001 and a momentum of 0.9. It took four training epochs to converge. The LSTM derivates were clipped in the range [−1, 1].

这个数据使用了一个比Penn数据大得多的网络(反映了训练集的更大的规模和复杂性),它有7个隐藏层,由700个LSTM单元组成,提供了大约21.3M的权重。该网络采用随机梯度下降训练,学习率为0.0001,动量为0.9。它花了四个训练的时代来汇合。LSTM衍生物被限制在[- 1,1]范围内。

As with the Penn data, we tested the network on the validation data with and without dynamic evaluation (where the weights are updated as the data is predicted). As can be seen from Table 2 performance was much better with dynamic evaluation. This is probably because of the long range coherence of Wikipedia data; for example, certain words are much more frequent in some articles than others, and being able to adapt to this during evaluation is advantageous. It may seem surprising that the dynamic results on the validation set were substantially better than on the training set. However this is easily explained by two factors: firstly, the network underfit the training data, and secondly some portions of the data are much more difficult than others (for example, plain text is harder to predict than XML tags).

与Penn的数据一样,我们在验证数据上对网络进行了测试,包括动态评估和非动态评估(根据预测的数据更新权重)。从表2可以看出,动态评估的性能要好得多。这可能是因为维基百科数据的长期一致性;例如,某些词汇在某些文章中出现的频率要比其他词汇高得多,能够在评估时适应这些词汇是有利的。看起来奇怪,验证动态结果集大大优于在训练集上。但是这很容易解释为两个因素:首先,网络underfit训练数据,第二部分的数据是比其他人更加困难(例如,纯文本更难预测比XML标签)。

To put the results in context, the current winner of the Hutter Prize (a variant of the PAQ-8 compression algorithm [20]) achieves 1.28 BPC on the same data (including the code required to implement the algorithm), mainstream compressors such as zip generally get more than 2, and a character level RNN applied to a text-only version of the data (i.e. with all the XML, markup tags etc. removed) achieved 1.54 on held-out data, which improved to 1.47 when the RNN was combined with a maximum entropy model [24].

上下文中的结果,当前Hutter奖得主(PAQ-8压缩算法[20]的一种变体)达到1.28 BPC相同的数据(包括所需的代码来实现算法),主流压缩机等邮政通常得到超过2,和一个人物等级RNN应用于数据的文本版本(即所有的XML标记标签等删除)1.54实现了数据,当RNN与最大熵模型[24]相结合时,其性能提高到1.47。

A four page sample generated by the prediction network is shown in Figs. 5 to 8. The sample shows that the network has learned a lot of structure from the data, at a wide range of different scales. Most obviously, it has learned a large vocabulary of dictionary words, along with a subword model that enables it to invent feasible-looking words and names: for example “Lochroom River”, “Mughal Ralvaldens”, “submandration”, “swalloped”. It has also learned basic punctuation, with commas, full stops and paragraph breaks occurring at roughly the right rhythm in the text blocks. 由预测网络生成的四页样本如图5 - 8所示。该示例表明,该网络从数据中学习了大量不同规模的结构。最明显的是,它学习了大量的字典词汇,以及一个子单词模型,使它能够发明看起来可行的单词和名称:例如“Lochroom River”、“Mughal Ralvaldens”、“submandration”、“swalloped”。它还学习了基本的标点符号,逗号、句号和断句在文本块中以大致正确的节奏出现。

Being able to correctly open and close quotation marks and parentheses is a clear indicator of a language model’s memory, because the closure cannot be predicted from the intervening text, and hence cannot be modelled with shortrange context [30]. The sample shows that the network is able to balance not only parentheses and quotes, but also formatting marks such as the equals signs used to denote headings, and even nested XML tags and indentation.

能够正确地打开和关闭引号和圆括号是语言模型内存的一个明确指标,因为不能从中间的文本中预测关闭,因此不能使用较短的上下文[30]建模。示例显示,该网络不仅能够平衡圆括号和引号,还能够平衡用于表示标题的等号等格式化标记,甚至还能够平衡嵌套的XML标记和缩进。

The network generates non-Latin characters such as Cyrillic, Chinese and Arabic, and seems to have learned a rudimentary model for languages other than English (e.g. it generates “es:Geotnia slago” for the Spanish ‘version’ of an article, and “nl:Rodenbaueri” for the Dutch one) It also generates convincing looking internet addresses (none of which appear to be real).

The network generates distinct, large-scale regions, such as XML headers, bullet-point lists and article text. Comparison with Figs. 3 and 4 suggests that these regions are a fairly accurate reflection of the constitution of the real data (although the generated versions tend to be somewhat shorter and more jumbled together). This is significant because each region may span hundreds or even thousands of timesteps. The fact that the network is able to remain coherent over such large intervals (even putting the regions in an approximately correct order, such as having headers at the start of articles and bullet-pointed ‘see also’ lists at the end) is testament to its long-range memory.

网络生成非拉丁字符如斯拉夫字母,中文和阿拉伯语,似乎学到了基本的模型除英语之外的其他语言(如生成“es: Geotnia slago”的西班牙语版的一篇文章,和荷兰的“问:Rodenbaueri”)看起来也会产生令人信服的互联网地址(似乎没有真正的)。

网络生成不同的大型区域,如XML标头、项目符号列表和文章文本。与图3和图4的比较表明,这些区域相当准确地反映了真实数据的构成(尽管生成的版本往往更短,更混乱)。这很重要,因为每个区域可能跨越数百甚至数千个时间步长。事实上,这个网络能够在如此大的时间间隔内保持一致(甚至将各个区域大致按正确的顺序排列,例如在文章开头有标题,在文章结尾有“参见”列表),这证明了它的长期记忆力。

As with all text generated by language models, the sample does not make sense beyond the level of short phrases. The realism could perhaps be improved with a larger network and/or more data. However, it seems futile to expect meaningful language from a machine that has never been exposed to the sensory world to which language refers.

Lastly, the network’s adaptation to recent sequences during training (which allows it to benefit from dynamic evaluation) can be clearly observed in the extract. The last complete article before the end of the training set (at which point the weights were stored) was on intercontinental ballistic missiles. The influence of this article on the network’s language model can be seen from the profusion of missile-related terms. Other recent topics include ‘Individual Anarchism’, the Italian writer Italo Calvino and the International Organization for Standardization (ISO), all of which make themselves felt in the network’s vocabulary.

与所有由语言模型生成的文本一样,示例的意义也仅限于短语级别。也许可以通过更大的网络和/或更多的数据来改进现实主义。然而,期待一台从未接触过语言所指的感官世界的机器发出有意义的语言似乎是徒劳的。

最后,在提取中可以清楚地观察到网络对训练过程中最近序列的适应性(这使它能够从动态评估中受益)。在训练集结束之前的最后一篇完整的文章是关于洲际弹道导弹的。这篇文章对网络语言模型的影响可以从大量的导弹相关术语中看出。最近的其他主题包括“个人无政府主义”、意大利作家伊塔洛·卡尔维诺和国际标准化组织(ISO),所有这些都在该网络的词汇中有所体现。

继续阅读