天天看点

Paper:《Generating Sequences With Recurrent Neural Networks》的翻译和解读(一)

Generating Sequences With Recurrent Neural Networks

利用递归神经网络生成序列

论文原文:Generating Sequences With Recurrent Neural Networks

作者:

Alex Graves   Department of Computer Science  

University of Toronto      [email protected]

Abstract

This paper shows how Long Short-term Memory recurrent neural net- works can be used to generate complex sequences with long-range struc- ture, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwrit- ing (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its predictions on a text sequence. The resulting system is able to generate highly realistic cursive handwriting in a wide variety of styles.

利用长短期记忆递归神经网络,通过简单地预测一个数据点来实现长时间的复杂序列生成。该方法适用于文本(数据是离散的)和在线手写(数据是实值的)。然后,通过允许网络根据文本序列调整预测,将其扩展到手写合成。由此产生的系统能够生成多种风格的高度逼真的草书。

1、Introduction 介绍

Recurrent neural networks (RNNs) are a rich class of dynamic models that have been used to generate sequences in domains as diverse as music [6, 4], text [30] and motion capture data [29]. RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be gener- ated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming. Although the network itself is deterministic, the stochas- ticity injected by picking samples induces a distribution over sequences. This distribution is conditional, since the internal state of the network, and hence its predictive distribution, depends on the previous inputs.

递归神经网络(RNNs)是一类丰富的动态模型,被用于生成音乐[6,4]、文本[30]和动作捕捉数据[29]等领域的序列。通过一步一步地处理真实的数据序列并预测接下来会发生什么,可以训练RNNs来生成序列。假设预测是概率性的,通过对网络的输出分布进行迭代采样,然后将样本作为下一步的输入,可以从训练好的网络中生成新的序列。换句话说,通过让网络把它的发明当作是真实的,就像一个人在做梦一样。虽然网络本身是确定性的,但通过取样注入的随机度会导致序列上的分布。这种分布是有条件的,因为网络的内部状态(因此它的预测分布)取决于以前的输入。

RNNs are ‘fuzzy’ in the sense that they do not use exact templates from the training data to make predictions, but rather—like other neural networks— use their internal representation to perform a high-dimensional interpolation between training examples. This distinguishes them from n-gram models and compression algorithms such as Prediction by Partial Matching [5], whose pre- dictive distributions are determined by counting exact matches between the recent history and the training set. The result—which is immediately appar-

ent from the samples in this paper—is that RNNs (unlike template-based al- gorithms) synthesise and reconstitute the training data in a complex way, and rarely generate the same thing twice. Furthermore, fuzzy predictions do not suf- fer from the curse of dimensionality, and are therefore much better at modelling real-valued or multivariate data than exact matches.

从某种意义上说,RNNs是“模糊的”,因为它们不使用来自训练数据的准确模板来进行预测,而是像其他神经网络一样,使用它们的内部表示来在训练实例之间执行高维插值。这将它们与n-gram模型和压缩算法(如通过部分匹配[5]进行预测)进行了区分,后者的预测分布是通过计算最近历史与训练集之间的精确匹配来确定的

从本文的样本中可以看出,RNNs(与基于模板的算法不同)以一种复杂的方式综合和重构训练数据,并且很少生成相同的内容两次。此外,模糊预测不依赖于维数的诅咒,因此在建模实值或多变量数据时,它比精确匹配要有效得多。

In principle a large enough RNN should be sufficient to generate sequences of arbitrary complexity. In practice however, standard RNNs are unable to store information about past inputs for very long [15]. As well as diminishing their ability to model long-range structure, this ‘amnesia’ makes them prone to instability when generating sequences. The problem (common to all conditional generative models) is that if the network’s predictions are only based on the last few inputs, and these inputs were themselves predicted by the network, it has little opportunity to recover from past mistakes. Having a longer memory has a stabilising effect, because even if the network cannot make sense of its recent history, it can look further back in the past to formulate its predictions. The problem of instability is especially acute with real-valued data, where it is easy for the predictions to stray from the manifold on which the training data lies. One remedy that has been proposed for conditional models is to inject noise into the predictions before feeding them back into the model [31], thereby increasing the model’s robustness to surprising inputs. However we believe that a better memory is a more profound and effective solution.

原则上,一个足够大的RNN应该足以生成任意复杂度的序列。然而,在实践中,标准的RNN不能存储关于非常长的[15]的过去输入的信息。这种“健忘症”不仅会削弱他们对长期结构建模的能力,还会使他们在生成序列时变得不稳定。问题是(所有条件生成模型共有的),如果网络的预测仅仅基于最后的几个输入,而这些输入本身是由网络预测的,那么它几乎没有机会从过去的错误中恢复过来。拥有更长的记忆有一个稳定的效果,因为即使网络不能理解它最近的历史,它可以回顾过去来制定它的预测。对于实值数据来说,不稳定性问题尤其严重,因为预测很容易偏离训练数据所在的流形。针对条件模型提出的一个补救措施是,在将预测反馈回模型[31]之前,在预测中加入噪音,从而提高模型对意外输入的鲁棒性。然而,我们相信更好的记忆是一个更深刻和有效的解决方案。

Long Short-term Memory (LSTM) [16] is an RNN architecture designed to be better at storing and accessing information than standard RNNs. LSTM has recently given state-of-the-art results in a variety of sequence processing tasks, including speech and handwriting recognition [10, 12]. The main goal of this paper is to demonstrate that LSTM can use its memory to generate complex, realistic sequences containing long-range structure. 长短时记忆(LSTM)[16]是一种RNN结构,它比标准的RNN更适合于存储和访问信息。LSTM最近在一系列序列处理任务中给出了最先进的结果,包括语音和手写识别[10,12]。本文的主要目的是证明LSTM可以利用它的内存来生成复杂的、真实的、包含长程结构的序列。

Figure 1: Deep recurrent neural network prediction architecture. The circles represent network layers, the solid lines represent weighted connections and the dashed lines represent predictions.

图1:深度递归神经网络预测体系结构。圆圈表示网络层,实线表示加权连接,虚线表示预测。

Section 2 defines a ‘deep’ RNN composed of stacked LSTM layers, and ex- plains how it can be trained for next-step prediction and hence sequence gener- ation. Section 3 applies the prediction network to text from the Penn Treebank and Hutter Prize Wikipedia datasets. The network’s performance is compet- itive with state-of-the-art language models, and it works almost as well when predicting one character at a time as when predicting one word at a time. The highlight of the section is a generated sample of Wikipedia text, which showcases the network’s ability to model long-range dependencies. Section 4 demonstrates how the prediction network can be applied to real-valued data through the use of a mixture density output layer, and provides experimental results on the IAM Online Handwriting Database. It also presents generated handwriting samples proving the network’s ability to learn letters and short words direct from pen traces, and to model global features of handwriting style. Section 5 introduces an extension to the prediction network that allows it to condition its outputs on a short annotation sequence whose alignment with the predictions is unknown. This makes it suitable for handwriting synthesis, where a human user inputs a text and the algorithm generates a handwritten version of it. The synthesis network is trained on the IAM database, then used to generate cursive hand- writing samples, some of which cannot be distinguished from real data by the naked eye. A method for biasing the samples towards higher probability (and greater legibility) is described, along with a technique for ‘priming’ the samples on real data and thereby mimicking a particular writer’s style. Finally, concluding remarks and directions for future work are given in Section 6.

第2节定义了一个由多层LSTM层组成的“深度”RNN,并讨论了如何训练它来进行下一步预测,从而实现序列生成。

第3节将预测网络应用于来自Penn Treebank和Hutter Prize Wikipedia数据集的文本。该网络的性能与最先进的语言模型是竞争的,它在预测一个字符时的效果几乎与预测一个单词时的效果一样好。本节的重点是生成的Wikipedia文本样本,它展示了网络建模长期依赖项的能力。

第4节演示了如何通过混合密度输出层将预测网络应用于实值数据,并提供了IAM在线手写数据库的实验结果。它还提供了生成的手写样本,证明该网络能够直接从手写轨迹学习字母和短单词,并对手写风格的全局特征进行建模。

第5节介绍了对预测网络的扩展,该扩展允许预测网络将其输出设置为与预测一致的短注释序列。这使得它适用于手写合成,即人类用户输入文本,然后算法生成手写版本。综合网络在IAM数据库上进行训练,然后生成草书手写样本,其中一些无法用肉眼分辨出真实数据。描述了一种将样本偏向于更高概率(以及更大的易读性)的方法,以及一种在真实数据上“启动”样本从而模仿特定作者风格的技术。

最后,第六节给出结论和今后工作的方向。

2 Prediction Network 预测网络

Fig. 1 illustrates the basic recurrent neural network prediction architecture used in this paper. An input vector sequence x = (x1, . . . , xT ) is passed through weighted connections to a stack of N recurrently connected hidden layers to compute first the hidden vector sequences h n = (h n 1 , . . . , hn T ) and then the output vector sequence y = (y1, . . . , yT ). Each output vector yt is used to parameterise a predictive distribution Pr(xt+1|yt) over the possible next inputs xt+1. The first element x1 of every input sequence is always a null vector whose entries are all zero; the network therefore emits a prediction for x2, the first real input, with no prior information. The network is ‘deep’ in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities.

图1给出了本文所采用的基本递归神经网络预测体系结构。一个输入向量序列x = (x1,…, xT)通过加权连接到N个递归连接的隐层堆栈中,首先计算隐藏向量序列h N = (h N 1,…然后输出向量序列y = (y1,…次)。每个输出向量yt被用来参数化一个预测分布Pr(xt+1|yt)对可能的下一个输入xt+1。每个输入序列的第一个元素x1总是一个零向量,它的所有元素都是零;因此,该网络发出对x2的预测,x2是第一个实际输入,没有先验信息。这个网络在空间和时间上都是“深”的,也就是说,通过计算图垂直或水平传递的每一条信息都将受到多个连续的权重矩阵和非线性的影响。  

Note the ‘skip connections’ from the inputs to all hidden layers, and from all hidden layers to the outputs. These make it easier to train deep networks, by reducing the number of processing steps between the bottom of the network and the top, and thereby mitigating the ‘vanishing gradient’ problem [1]. In the special case that N = 1 the architecture reduces to an ordinary, single layer next step prediction RNN. 注意从输入到所有隐藏层的“跳过连接”,以及从所有隐藏层到输出的“跳过连接”。通过减少网络底部和顶部之间的处理步骤的数量,从而降低了训练深度网络的难度,从而减轻了[1]的“消失梯度”问题。在N = 1的特殊情况下,该体系结构简化为一个普通的单层下一步预测RNN。

The hidden layer activations are computed by iterating the following equations from t = 1 to T and from n = 2 to N

where the W terms denote weight matrices (e.g. Wihn is the weight matrix connecting the inputs to the n th hidden layer, Wh1h1 is the recurrent connection at the first hidden layer, and so on), the b terms denote bias vectors (e.g. by is output bias vector) and H is the hidden layer function.

Given the hidden sequences, the output sequence is computed as follow:

where Y is the output layer function. The complete network therefore defines a function, parameterised by the weight matrices, from input histories x1:t to output vectors yt.

The output vectors yt are used to parameterise the predictive distribution Pr(xt+1|yt) for the next input. The form of Pr(xt+1|yt) must be chosen carefully to match the input data. In particular, finding a good predictive distribution for high-dimensional, real-valued data (usually referred to as density modelling), can be very challenging.

The probability given by the network to the input sequence x

The partial derivatives of the loss with respect to the network weights can be efficiently calculated with backpropagation through time [33] applied to the computation graph shown in Fig. 1, and the network can then be trained with gradient descen

隐层激活的计算方法如下:从t = 1到t,从n = 2到n

矩阵W术语表示的重量(例如Wihn权重矩阵连接输入n th隐藏层,Wh1h1是复发性连接在第一个隐藏层,等等),b项表示偏差向量(例如输出偏差向量)和H是隐藏层的功能。

给定隐藏序列,输出序列计算如下:

其中Y为输出层函数。因此,整个网络定义了一个函数,由权矩阵参数化,从输入历史x1:t到输出向量yt。

输出向量yt用于参数化下一个输入的预测分布Pr(xt+1|yt)。必须仔细选择Pr(xt+1|yt)的形式来匹配输入数据。特别是,为高维、实值数据(通常称为密度建模)找到一个好的预测分布是非常具有挑战性的。

由网络给出的输入序列x的概率

通过对图1所示的计算图进行时间[33]的反向传播,可以有效地计算出损失相对于网络权值的偏导数,然后通过梯度下行对网络进行训练

2.1 Long Short-Term Memory

               Figure 2: Long Short-term Memory Cell

In most RNNs the hidden layer function H is an elementwise application of a sigmoid function. However we have found that the Long Short-Term rm Memory (LSTM) architecture [16], which uses purpose-built memory cells to store information, is better at finding and exploiting long range dependencies in the data. Fig. 2 illustrates a single LSTM memory cell. For the version of LSTM used in this paper [7] H is implemented by the following composite function:

where σ is the logistic sigmoid function, and i, f, o and c are respectively the input gate, forget gate, output gate, cell and cell input activation vectors, all of which are the same size as the hidden vector h. The weight matrix subscripts have the obvious meaning, for example Whi is the hidden-input gate matrix, Wxo is the input-output gate matrix etc. The weight matrices from the cell to gate vectors (e.g. Wci) are diagonal, so element m in each gate vector only receives input from element m of the cell vector. The bias terms (which are added to i, f, c and o) have been omitted for clarity.

图2:长短时记忆细胞

在大多数网络中,隐层函数H是s型函数的基本应用。然而,我们发现,使用专门构建的内存单元来存储信息的长短期rm内存(LSTM)体系结构[16]更善于发现和利用数据中的长期依赖关系。图2显示了单个LSTM存储单元。对于本文使用的LSTM版本,[7]H通过以下复合函数实现:

σ是物流乙状结肠函数,和我,f, o和c分别输入门,忘记门,输出门,细胞和细胞激活输入向量,都是同样的大小隐藏向量h。权重矩阵下标有明显的意义,例如Whi隐藏输入门矩阵,Wxo输入-输出门矩阵等。单元到栅极向量(如Wci)的权重矩阵是对角的,因此每个栅极向量中的m元素只接收单元向量的m元素的输入。偏置项(添加到i、f、c和o中)被省略,以保持清晰。

The original LSTM algorithm used a custom designed approximate gradient calculation that allowed the weights to be updated after every timestep [16]. However the full gradient can instead be calculated with backpropagation through time [11], the method used in this paper. One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large,leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range1 .

原始的LSTM算法使用了自定义设计的近似梯度计算,允许在每一步[16]之后更新权值。然而,全梯度可以通过时间[11]的反向传播来计算,这是本文使用的方法。用全梯度法训练LSTM的一个难点是导数有时会变得过大,导致数值问题。为了防止这种情况,本文中的所有实验都将损耗对LSTM层的网络输入的导数(在应用sigmoid和tanh函数之前)限制在预定义的范围e1内。