天天看点

Paper:《Spatial Transformer Networks》的翻译与解读(二)

3.1 Localisation Network  

The localisation network takes the input feature map U ∈ R  H×W×C with width W, height H and  C channels and outputs θ, the parameters of the transformation Tθ to be applied to the feature map:  θ = floc(U). The size of θ can vary depending on the transformation type that is parameterised,  e.g. for an affine transformation θ is 6-dimensional as in (10).  The localisation network function floc() can take any form, such as a fully-connected network or  a convolutional network, but should include a final regression layer to produce the transformation  parameters θ. 定位网络取输入特征图U∈R H×W×C,宽W,高H, C通道,输出θ,应用于特征图的变换Tθ的参数:θ = floc(U)。θ的大小可以根据参数化的转换类型而变化,例如。对于仿射变换,θ是6维的,如(10)。定位网络函数floc()可以采取任何形式,例如完全连接的网络或卷积网络,但应该包括一个最终的回归层来产生转换参数θ。

3.2 Parameterised Sampling Grid  

To perform a warping of the input feature map, each output pixel is computed by applying a sampling  kernel centered at a particular location in the input feature map (this is described fully in the next  section). By pixel we refer to an element of a generic feature map, not necessarily an image. In  general, the output pixels are defined to lie on a regular grid G = {Gi} of pixels Gi = (x  t  i  , yt  i  ),  forming an output feature map V ∈ R  H0×W0×C , where H0  and W0  are the height and width of the  grid, and C is the number of channels, which is the same in the input and output. 要对输入特征映射执行扭曲,需要通过应用以输入特征映射中特定位置为中心的采样核来计算每个输出像素(下一节将对此进行详细描述)。像素指的是一般特征图的一个元素,不一定是图像。一般来说,躺在一个常规定义的输出像素网格G = {Gi}像素Gi = (x t,欧美我),形成一个输出特性映射V∈R H0×W0×C, H0和W0网格的高度和宽度,和C是通道的数量,输入和输出是相同的。

where (x  t  i  , yt  i  ) are the target coordinates of the regular grid in the output feature map, (x  s  i  , ys  i  ) are  the source coordinates in the input feature map that define the sample points, and Aθ is the affine  transformation matrix. We use height and width normalised coordinates, such that −1 ≤ x  t  i  , yt  i ≤ 1  when within the spatial bounds of the output, and −1 ≤ x  s  i  , ys  i ≤ 1 when within the spatial bounds  of the input (and similarly for the y coordinates). The source/target transformation and sampling is  equivalent to the standard texture mapping and coordinates used in graphics [8].

其中(x ti, yt i)为输出特征映射中规则网格的目标坐标,(x s i, ys i)为定义样本点的输入特征映射中的源坐标,Aθ为仿射变换矩阵。我们使用的高度和宽度正常化坐标,这样−1≤x t我次我≤1时在空间范围内的输出,和−1≤x, y≤1时在空间范围内的输入(同样的y坐标)。源/目标转换和采样等价于图形[8]中使用的标准纹理映射和坐标。

The class of transformations Tθ may be more constrained, such as that used for attention  Aθ =    s 0 tx  0 s ty    (2)  allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation  Tθ can also be more general, such as a plane projective transformation with 8 parameters, piecewise  affine, or a thin plate spline. Indeed, the transformation can have any parameterised form,  provided that it is differentiable with respect to the parameters – this crucially allows gradients to be  backpropagated through from the sample points Tθ(Gi) to the localisation network output θ. If the  transformation is parameterised in a structured, low-dimensional way, this reduces the complexity  of the task assigned to the localisation network. For instance, a generic class of structured and differentiable  transformations, which is a superset of attention, affine, projective, and thin plate spline  transformations, is Tθ = MθB, where B is a target grid representation (e.g. in (10), B is the regular  grid G in homogeneous coordinates), and Mθ is a matrix parameterised by θ. In this case it is  possible to not only learn how to predict θ for a sample, but also to learn B for the task at hand.

类的转换Tθ可能更多限制,比如用于注意θ= 0 0 tx泰(2)允许裁剪,翻译,和各向同性缩放到不同年代,tx,泰,变换Tθ也可以更普遍,如使用8参数平面射影变换,分段仿射或薄板样条。事实上,变换可以有任何参数化形式,只要它对参数是可微的——这至关重要地允许梯度通过样本点Tθ(Gi)反向传播到定位网络输出θ。如果转换以结构化、低维的方式参数化,这将降低分配给本地化网络的任务的复杂性。例如,一个泛型类的结构化和可微的转换,这是一个超集的关注,仿射,投影,和薄板样条转换,是M Tθ=θB, B是一个目标网格表示(例如在(10),B是定期在齐次坐标网格G),和Mθ是一个矩阵parameterisedθ。在这种情况下,不仅可以学习如何预测样本的θ,而且可以学习当前任务的B。

3.3 Differentiable Image Sampling  

To perform a spatial transformation of the input feature map, a sampler must take the set of sampling  points Tθ(G), along with the input feature map U and produce the sampled output feature map V .  Each (x  s  i  , ys  i  ) coordinate in Tθ(G) defines the spatial location in the input where a sampling kernel  is applied to get the value at a particular pixel in the output V . This can be written as

为了对输入特征映射进行空间变换,采样器必须取采样点集Tθ(G),同时取输入特征映射U,并产生采样输出特征映射V。Tθ(G)中的每个(x s i, ys i)坐标定义了输入中的空间位置,采样核应用于此,以获得输出V中特定像素的值。这可以写成

where Φx and Φy are the parameters of a generic sampling kernel k() which defines the image  interpolation (e.g. bilinear), U  c  nm is the value at location (n, m) in channel c of the input, and V  c  i  is the output value for pixel i at location (x  t  i  , yt  i  ) in channel c. Note that the sampling is done  identically for each channel of the input, so every channel is transformed in an identical way (this  preserves spatial consistency between channels).  In theory, any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to  x  s  i  and y  s  i  . For example, using the integer sampling kernel reduces (3) to  V  c  i =  X  H  n  X  W  m  U  c  nmδ(bx  s  i + 0.5c − m)δ(by  s  i + 0.5c − n) (4)  where bx + 0.5c rounds x to the nearest integer and δ() is the Kronecker delta function. This  sampling kernel equates to just copying the value at the nearest pixel to (x  s  i  , ys  i  ) to the output location  (x  t  i  , yt  i  ). Alternatively, a bilinear sampling kernel can be used, giving  V  c  i =  X  H  n  X  W  m  U  c  nm max(0, 1 − |x  s  i − m|) max(0, 1 − |y  s  i − n|) (5)  To allow backpropagation of the loss through this sampling mechanism we can define the gradients  with respect to U and G. For bilinear sampling (5) the partial derivatives are  ∂V c  i  ∂Uc  nm  =  X  H  n  X  W  m  max(0, 1 − |x  s  i − m|) max(0, 1 − |y  s  i − n|) (6)  ∂V c  i  ∂xs  i  =  X  H  n  X  W  m  U  c  nm max(0, 1 − |y  s  i − n|)        0 if |m − x  s  i  | ≥ 1  1 if m ≥ x  s  i  −1 if m < xs  i  (7)  and similarly to (7) for ∂V c  i  ∂ys  i  .

This gives us a (sub-)differentiable sampling mechanism, allowing loss gradients to flow back not  only to the input feature map (6), but also to the sampling grid coordinates (7), and therefore back  to the transformation parameters θ and localisation network since ∂xs  i  ∂θ and ∂xs  i  ∂θ can be easily derived  from (10) for example. Due to discontinuities in the sampling fuctions, sub-gradients must be used.  This sampling mechanism can be implemented very efficiently on GPU, by ignoring the sum over  all input locations and instead just looking at the kernel support region for each output pixel. 这给了我们一个(子)可微的采样机制,不仅允许损失梯度回流的输入特性图(6),而且采样网格坐标(7),因此回转换参数θ和本地化网络自∂x我∂θ和∂x我∂θ可以很容易地由(10)为例。由于抽样函数的不连续,必须使用次梯度。这种采样机制可以在GPU上非常有效地实现,忽略所有输入位置的总和,而只是查看每个输出像素的内核支持区域。

3.4 Spatial Transformer Networks  

The combination of the localisation network, grid generator, and sampler form a spatial transformer  (Fig. 2). This is a self-contained module which can be dropped into a CNN architecture at any point,  and in any number, giving rise to spatial transformer networks. This module is computationally very  fast and does not impair the training speed, causing very little time overhead when used naively, and  even speedups in attentive models due to subsequent downsampling that can be applied to the output  of the transformer.  

Placing spatial transformers within a CNN allows the network to learn how to actively transform  the feature maps to help minimise the overall cost function of the network during training. The  knowledge of how to transform each training sample is compressed and cached in the weights of  the localisation network (and also the weights of the layers previous to a spatial transformer) during  training. For some tasks, it may also be useful to feed the output of the localisation network, θ,  forward to the rest of the network, as it explicitly encodes the transformation, and hence the pose, of  a region or object.   定位网络、网格发生器和采样器的组合形成了一个空间变压器(图。2).这是一个自包含的模块,可以在任意点,任意数量的放入CNN架构中,从而产生空间变压器网络。该模块的计算速度非常快,不影响训练速度,在天真地使用时造成的时间开销非常小,甚至在细心的模型中加速,因为后续的下采样可以应用到变压器的输出。在CNN中放置空间变压器可以让网络学习如何积极地转换特征图,以帮助在训练期间最小化网络的总体成本函数。在训练期间,如何转换每个训练样本的知识被压缩并缓存在本地化网络的权值中(以及空间转换器之前的层的权值)。对于某些任务,它也可能是有用的供给定位网络的输出,θ,向前到网络的其余部分,因为它明确编码转换,因此姿态,一个区域或对象。&nbsp;

Table 1: Left: The percentage errors for different models on different distorted MNIST datasets. The different  distorted MNIST datasets we test are TC: translated and cluttered, R: rotated, RTS: rotated, translated, and  scaled, P: projective distortion, E: elastic distortion. All the models used for each experiment have the same  number of parameters, and same base structure for all experiments. Right: Some example test images where  a spatial transformer network correctly classifies the digit but a CNN fails. (a) The inputs to the networks. (b)  The transformations predicted by the spatial transformers, visualised by the grid Tθ(G). (c) The outputs of the  spatial transformers. E and RTS examples use thin plate spline spatial transformers (ST-CNN TPS), while R  examples use affine spatial transformers (ST-CNN Aff) with the angles of the affine transformations given. For  videos showing animations of these experiments and more see

https://goo.gl/qdEhUu.

表1:左:不同模型在不同失真MNIST数据集上的误差百分比。我们测试的不同扭曲MNIST数据集是TC:平移和杂波,R:旋转,RTS:旋转,平移和缩放,P:投影失真,E:弹性失真。各实验所用模型参数数目相同,基本结构相同。右图:一些测试图像的例子,其中空间变压器网络正确地分类数字,但CNN失败了。(a)网络的输入。(b)空间变压器预测的变换,由网格Tθ(G)可视化。(c)空间变压器的输出。E和RTS的例子使用薄板样条空间变压器(ST-CNN TPS),而R的例子使用仿射空间变压器(ST-CNN Aff),其仿射变换的角度是给定的。有关这些实验动画的视频和更多内容,请参见

https://goo.gl/qdEhUu

It is also possible to use spatial transformers to downsample or oversample a feature map, as one can  define the output dimensions H0  and W0  to be different to the input dimensions H and W. However,  with sampling kernels with a fixed, small spatial support (such as the bilinear kernel), downsampling  with a spatial transformer can cause aliasing effects.

Finally, it is possible to have multiple spatial transformers in a CNN. Placing multiple spatial transformers  at increasing depths of a network allow transformations of increasingly abstract representations,  and also gives the localisation networks potentially more informative representations to base  the predicted transformation parameters on. One can also use multiple spatial transformers in parallel  – this can be useful if there are multiple objects or parts of interest in a feature map that should be  focussed on individually. A limitation of this architecture in a purely feed-forward network is that  the number of parallel spatial transformers limits the number of objects that the network can model. 还可以使用空间变形金刚downsample或oversample功能地图,作为一个可以定义的输出尺寸H0和W0不同输入维度H和w .然而,与一个固定的采样内核,小空间(如双线性内核)的支持,将采样空间变压器可能导致混叠效应。最后,在一个CNN中可以有多个空间变压器。在网络的深度增加时放置多个空间转换器,可以实现越来越抽象的表示形式的转换,同时也为定位网络提供了潜在的更有信息的表示形式,从而可以根据预测的转换参数进行转换。还可以同时使用多个空间转换器——如果在一个特征图中有多个对象或感兴趣的部分需要分别关注,这可能会很有用。在纯前馈网络中,这种架构的一个限制是并行空间变压器的数量限制了网络可以建模的对象的数量。

4 Experiments  

In this section we explore the use of spatial transformer networks on a number of supervised learning  tasks. In Sect. 4.1 we begin with experiments on distorted versions of the MNIST handwriting  dataset, showing the ability of spatial transformers to improve classification performance through  actively transforming the input images. In Sect. 4.2 we test spatial transformer networks on a challenging  real-world dataset, Street View House Numbers [25], for number recognition, showing stateof-the-art  results using multiple spatial transformers embedded in the convolutional stack of a CNN.  Finally, in Sect. 4.3, we investigate the use of multiple parallel spatial transformers for fine-grained  classification, showing state-of-the-art performance on CUB-200-2011 birds dataset [38] by discovering  object parts and learning to attend to them. Further experiments of MNIST addition and  co-localisation can be found in Appendix A.   在本节中,我们将探索空间变压器网络在若干监督学习任务中的使用。在4.1节中,我们首先对MNIST笔迹数据集的扭曲版本进行实验,展示了空间变换器通过主动转换输入图像来提高分类性能的能力。在第4.2节中,我们在具有挑战性的真实世界数据集上测试了空间变压器网络,街道视图房号[25],用于数字识别,使用嵌入在CNN卷积堆栈中的多个空间变压器显示了最先进的结果。最后,在第4.3节中,我们研究了多个并行空间变形器用于细粒度分类的使用,通过发现对象部件并学习注意它们,展示了cube -200-2011 birds数据集[38]的最先进性能。进一步的MNIST添加和共定位实验可以在附录A中找到。

4.1 Distorted MNIST  

In this section we use the MNIST handwriting dataset as a testbed for exploring the range of transformations  to which a network can learn invariance to by using a spatial transformer.

We begin with experiments where we train different neural network models to classify MNIST data  that has been distorted in various ways: rotation (R), rotation, scale and translation (RTS), projective  transformation (P), and elastic warping (E) – note that elastic warping is destructive and can not be  inverted in some cases. The full details of the distortions used to generate this data are given in  Appendix A. We train baseline fully-connected (FCN) and convolutional (CNN) neural networks,  as well as networks with spatial transformers acting on the input before the classification network  (ST-FCN and ST-CNN). The spatial transformer networks all use bilinear sampling, but variants use  different transformation functions: an affine transformation (Aff), projective transformation (Proj),  and a 16-point thin plate spline transformation (TPS) [2]. The CNN models include two max-pooling  layers. All networks have approximately the same number of parameters, are trained with identical  optimisation schemes (backpropagation, SGD, scheduled learning rate decrease, with a multinomial  cross entropy loss), and all with three weight layers in the classification network.

在本节中,我们使用MNIST手写数据集作为测试平台,来探索网络可以通过使用空间转换器学习到的不变性的转换范围。我们从实验开始训练不同的神经网络模型分类MNIST数据已经以各种方式扭曲:旋转(R)、旋转、尺度和翻译(RTS)、投影转换(P)和弹性变形(E) -注意,弹性变形是毁灭性的和在某些情况下不能倒。用于生成这一数据的扭曲的全部细节见附录a。我们训练基线全连接(FCN)和卷积(CNN)神经网络,以及在分类网络(ST-FCN和ST-CNN)之前使用空间变压器作用于输入的网络。空间变压器网络都使用双线性采样,但不同的变体使用不同的变换函数:仿射变换(Aff)、投影变换(Proj)和16点薄板样条变换(TPS)[2]。CNN的模型包括两个最大汇集层。所有的网络具有近似相同的参数数目,使用相同的优化方案(backpropagation, SGD,调度学习速率下降,有多项交叉熵损失)进行训练,并且在分类网络中都有三个权层。

Table 2: Left: The sequence error for SVHN multi-digit recognition on crops of 64 × 64 pixels (64px), and  inflated crops of 128 × 128 (128px) which include more background. *The best reported result from [1] uses  model averaging and Monte Carlo averaging, whereas the results from other models are from a single forward  pass of a single model. Right: (a) The schematic of the ST-CNN Multi model. The transformations applied by  each spatial transformer (ST) is applied to the convolutional feature map produced by the previous layer. (b)  The result of multiplying out the affine transformations predicted by the four spatial transformers in ST-CNN  Multi, visualised on the input image.

表2:左:64 × 64像素(64px)作物的SVHN多位数识别序列错误,128 × 128 (128px)膨大的作物包含更多的背景。*[1]报告的最佳结果使用了模型平均和蒙特卡罗平均,而其他模型的结果来自单个模型的单次向前传递。右:(a) ST-CNN多模型示意图。每个空间变换器(ST)的变换应用于前一层生成的卷积特征图。(b)将ST-CNN Multi中的四个空间变压器预测的仿射变换乘出来的结果,在输入图像上显示出来。

The results of these experiments are shown in Table 1 (left). Looking at any particular type of distortion  of the data, it is clear that a spatial transformer enabled network outperforms its counterpart  base network. For the case of rotation, translation, and scale distortion (RTS), the ST-CNN achieves 0.5% and 0.6% depending on the class of transform used for Tθ, whereas a CNN, with two maxpooling  layers to provide spatial invariance, achieves 0.8% error. This is in fact the same error that  the ST-FCN achieves, which is without a single convolution or max-pooling layer in its network,  showing that using a spatial transformer is an alternative way to achieve spatial invariance. ST-CNN  models consistently perform better than ST-FCN models due to max-pooling layers in ST-CNN providing  even more spatial invariance, and convolutional layers better modelling local structure. We  also test our models in a noisy environment, on 60 × 60 images with translated MNIST digits and  background clutter (see Fig. 1 third row for an example): an FCN gets 13.2% error, a CNN gets  3.5% error, while an ST-FCN gets 2.0% error and an ST-CNN gets 1.7% error.  

Looking at the results between different classes of transformation, the thin plate spline transformation  (TPS) is the most powerful, being able to reduce error on elastically deformed digits by  reshaping the input into a prototype instance of the digit, reducing the complexity of the task for the  classification network, and does not over fit on simpler data e.g. R. Interestingly, the transformation  of inputs for all ST models leads to a “standard” upright posed digit – this is the mean pose found  in the training data. In Table 1 (right), we show the transformations performed for some test cases  where a CNN is unable to correctly classify the digit, but a spatial transformer network can. Further  test examples are visualised in an animation here

实验结果见表1(左)。观察任何特定类型的数据失真,可以清楚地看出空间转换器支持的网络性能优于其对应的基础网络。在旋转、平移和尺度失真(RTS)的情况下,ST-CNN根据用于Tθ的变换类别达到0.5%和0.6%,而使用两个maxpooling层来提供空间不变性的CNN达到0.8%的误差。这实际上与ST-FCN实现的错误相同,ST-FCN在其网络中没有单一的卷积或最大池化层,这表明使用空间转换器是实现空间不变性的另一种方法。ST-CNN模型的性能始终优于ST-FCN模型,因为ST-CNN中的max-pooling层提供了更多的空间不变性,卷积层更好地建模局部结构。我们也测试模型在一个嘈杂的环境中,在60×60与翻译MNIST数字图像和背景杂波(见图1第三行为例):一个FCN得到13.2%的误差,CNN获得3.5%的误差,而ST-FCN得到2.0%的误差和ST-CNN得到1.7%的错误。&nbsp;

观察结果之间的不同类型的转换,薄板样条转换(TPS)是最强大的,能够减少错误弹性变形数字通过重塑输入数字的一个原型实例,减少任务分类网络的复杂性,且不适合在简单的数据,比如r .有趣的是,对所有ST模型的输入进行转换,得到一个“标准”的直立姿势数字——这是在训练数据中发现的平均姿势。在表1(右)中,我们展示了在一些测试用例中执行的转换,其中CNN不能正确地分类数字,但空间转换器网络可以。更多的测试示例可以在一个动画中看到

4.2 Street View House Numbers  

We now test our spatial transformer networks on a challenging real-world dataset, Street View House  Numbers (SVHN) [25]. This dataset contains around 200k real world images of house numbers, with  the task to recognise the sequence of numbers in each image. There are between 1 and 5 digits in  each image, with a large variability in scale and spatial arrangement.  

We follow the experimental setup as in [1, 13], where the data is preprocessed by taking 64 × 64  crops around each digit sequence. We also use an additional more loosely 128×128 cropped dataset  as in [1]. We train a baseline character sequence CNN model with 11 hidden layers leading to five  independent softmax classifiers, each one predicting the digit at a particular position in the sequence.  This is the character sequence model used in [19], where each classifier includes a null-character  output to model variable length sequences. This model matches the results obtained in [13].  

我们现在在一个具有挑战性的真实世界数据集上测试我们的空间转换器网络,街景房屋号码(SVHN)[25]。该数据集包含约20万张真实世界的门牌号图像,任务是识别每张图像中的数字序列。每张图像的数字在1 - 5位之间,在尺度和空间安排上有很大的变异性。我们遵循[1,13]中的实验设置,在每个数字序列周围取64 × 64个作物对数据进行预处理。我们还使用另一个更松散的128×128裁切数据集,如[1]中所示。我们训练了一个基线字符序列CNN模型,该模型有11个隐藏层,形成5个独立的softmax分类器,每个分类器预测序列中特定位置的数字。这是[19]中使用的字符序列模型,其中每个分类器都包含一个空字符输出来为可变长度序列建模。该模型与[13]得到的结果相匹配。

We extend this baseline CNN to include a spatial transformer immediately following the input (STCNN  Single), where the localisation network is a four-layer CNN. We also define another extension  where before each of the first four convolutional layers of the baseline CNN, we insert a spatial  transformer (ST-CNN Multi), where the localisation networks are all two layer fully connected networks  with 32 units per layer. In the ST-CNN Multi model, the spatial transformer before the first  convolutional layer acts on the input image as with the previous experiments, however the subsequent  spatial transformers deeper in the network act on the convolutional feature maps, predicting a  transformation from them and transforming these feature maps (this is visualised in Table 2 (right)  (a)). This allows deeper spatial transformers to predict a transformation based on richer features  rather than the raw image. All networks are trained from scratch with SGD and dropout [17], with  randomly initialised weights, except for the regression layers of spatial transformers which are initialised  to predict the identity transform. Affine transformations and bilinear sampling kernels are  used for all spatial transformer networks in these experiments. 我们扩展了这个基线CNN,包括一个紧跟输入的空间转换器(STCNN单),其中定位网络是一个四层的CNN。我们还定义了另一个扩展,在基线CNN的前四个卷积层之前,我们插入一个空间转换器(ST-CNN Multi),其中定位网络都是两层完全连接的网络,每层有32个单元。ST-CNN多模型、空间变压器之前第一个卷积层作用于输入图像与之前的实验一样,然而随后在回旋的空间变形金刚更深层次的网络行为特征图,预测一个转换和转换这些特征图(这是呈现在表2(右)(a))。这使得更深层的空间变换器能够根据更丰富的特征而不是原始图像来预测变换。除了空间变换的回归层被初始化以预测身份变换外,所有网络都用SGD和dropout[17]从零开始训练,并随机初始化权值。在这些实验中,所有的空间变压器网络都采用了仿射变换和双线性采样核。

Table 3: Left: The accuracy on CUB-200-2011 bird classification dataset. Spatial transformer networks with  two spatial transformers (2×ST-CNN) and four spatial transformers (4×ST-CNN) in parallel achieve higher  accuracy. 448px resolution images can be used with the ST-CNN without an increase in computational cost  due to downsampling to 224px after the transformers. Right: The transformation predicted by the spatial  transformers of 2×ST-CNN (top row) and 4×ST-CNN (bottom row) on the input image. Notably for the  2×ST-CNN, one of the transformers (shown in red) learns to detect heads, while the other (shown in green)  detects the body, and similarly for the 4×ST-CNN.

表3:左:CUB-200-2011鸟类分类数据集的精度。两个空间变压器(2×ST-CNN)和四个空间变压器(4×ST-CNN)并行的空间变压器网络可以实现更高的精度。448px分辨率的图像可以与ST-CNN一起使用,而无需增加计算成本,因为经过变压器后降采样到224px。右:空间变换2×ST-CNN(上一行)和4×ST-CNN(下一行)在输入图像上预测的变换。值得注意的是2×ST-CNN,其中一个变形金刚(红色显示)学习检测头部,而另一个(绿色显示)检测身体,4×ST-CNN也是如此。

The results of this experiment are shown in Table 2 (left) – the spatial transformer models obtain  state-of-the-art results, reaching 3.6% error on 64×64 images compared to previous state-of-the-art  of 3.9% error. Interestingly on 128 × 128 images, while other methods degrade in performance,  an ST-CNN achieves 3.9% error while the previous state of the art at 4.5% error is with a recurrent  attention model that uses an ensemble of models with Monte Carlo averaging – in contrast the STCNN  models require only a single forward pass of a single model. This accuracy is achieved due to  the fact that the spatial transformers crop and rescale the parts of the feature maps that correspond  to the digit, focussing resolution and network capacity only on these areas (see Table 2 (right) (b)  for some examples). In terms of computation speed, the ST-CNN Multi model is only 6% slower  (forward and backward pass) than the CNN. 本实验的结果如表2(左)所示,空间转换器模型获得了最先进的结果,在64×64图像上的误差达到3.6%,而之前的最先进的误差为3.9%。有趣的是在128×128的图片,而其他方法降解性能,ST-CNN达到3.9%错误在之前的4.5%的误差是复发性注意力模型,它使用一个模型与蒙特卡罗平均——相比之下STCNN模型只需要一个传球前进的一个模型。之所以能达到这样的精度,是因为空间变换器只在这些区域对与数字对应的特征地图部分进行裁剪和缩放,集中分辨率和网络容量(一些例子见表2(右)(b))。在计算速度方面,ST-CNN Multi model仅比CNN慢6%(前向和后向传递)。

4.3 Fine-Grained Classification  

In this section, we use a spatial transformer network with multiple transformers in parallel to perform  fine-grained bird classification. We evaluate our models on the CUB-200-2011 birds dataset [38],  containing 6k training images and 5.8k test images, covering 200 species of birds. The birds appear  at a range of scales and orientations, are not tightly cropped, and require detailed texture and shape  analysis to distinguish. In our experiments, we only use image class labels for training.  

We consider a strong baseline CNN model – an Inception architecture with batch normalisation [18]  pre-trained on ImageNet [26] and fine-tuned on CUB – which by itself achieves the state-of-theart  accuracy of 82.3% (previous best result is 81.0% [30]). We then train a spatial transformer  network, ST-CNN, which contains 2 or 4 parallel spatial transformers, parameterised for attention  and acting on the input image. Discriminative image parts, captured by the transformers, are passed  to the part description sub-nets (each of which is also initialised by Inception). The resulting part  representations are concatenated and classified with a single softmax layer. The whole architecture  is trained on image class labels end-to-end with backpropagation (full details in Appendix A).  

在本节中,我们使用一个并行地包含多个变压器的空间变压器网络来执行细粒度的鸟分类。我们在CUB-200-2011鸟类数据集[38]上评估我们的模型,该数据集包含6k的训练图像和5.8k的测试图像,涵盖200种鸟类。这些鸟出现在不同的尺度和方向上,没有被紧密裁剪,需要详细的纹理和形状分析来区分。在我们的实验中,我们只使用图像类标签进行训练。我们认为一个强大的基线CNN模型——一个在ImageNet[26]上预先训练并在CUB上进行调整的带有批处理标准化[18]的初始架构——它本身就达到了最先进的82.3%的精度(之前最好的结果是81.0%[30])。然后我们训练一个空间变压器网络ST-CNN,它包含2或4个平行的空间变压器,参数化的注意力和作用于输入图像。由变形金刚捕获的鉴别图像部件被传递到部件描述子网(每个子网也在Inception时初始化)。产生的部件表示用一个单一的softmax层连接和分类。整个体系结构是用图像类标签端到端的反向传播进行训练的(详细信息见附录A)。

The results are shown in Table 3 (left). The ST-CNN achieves an accuracy of 84.1%, outperforming  the baseline by 1.8%. It should be noted that there is a small (22/5794) overlap between the ImageNet  training set and CUB-200-2011 test set1 – removing these images from the test set results in  84.0% accuracy with the same ST-CNN. In the visualisations of the transforms predicted by 2×STCNN  (Table 3 (right)) one can see interesting behaviour has been learnt: one spatial transformer  (red) has learnt to become a head detector, while the other (green) fixates on the central part of the  body of a bird. The resulting output from the spatial transformers for the classification network is  a somewhat pose-normalised representation of a bird. While previous work such as [3] explicitly  define parts of the bird, training separate detectors for these parts with supplied keypoint training  data, the ST-CNN is able to discover and learn part detectors in a data-driven manner without any  additional supervision. In addition, the use of spatial transformers allows us to use 448px resolution  input images without any impact in performance, as the output of the transformed 448px images are  downsampled to 224px before being processed. 结果见表3(左)。ST-CNN的准确率达到了84.1%,比基准高出1.8%。需要注意的是,ImageNet训练集和CUB-200-2011测试集1之间有一个小的(22/5794)重叠——在ST-CNN相同的情况下,从测试集中去除这些图像的准确率为84.0%。从2×STCNN(表3(右))预测的变换的可视化图中,我们可以看到人们学会了一些有趣的行为:一个空间变形器(红色)学会了成为头部探测器,而另一个(绿色)则专注于鸟的身体中央部分。从空间变压器的结果输出的分类网络是一个姿态归一化的鸟类表示。虽然之前的工作,如[3]明确地定义了鸟的部分,训练这些部分的单独的检测器与提供的关键训练数据,ST-CNN能够以数据驱动的方式发现和学习部分检测器,而不需要任何额外的监督。此外,空间转换器的使用允许我们在不影响性能的情况下使用448px分辨率的输入图像,因为转换后的448px图像的输出在处理之前会被向下采样到224px。

5 Conclusion

In this paper we introduced a new self-contained module for neural networks – the spatial transformer.  This module can be dropped into a network and perform explicit spatial transformations  of features, opening up new ways for neural networks to model data, and is learnt in an end-toend  fashion, without making any changes to the loss function. While CNNs provide an incredibly  strong baseline, we see gains in accuracy using spatial transformers across multiple tasks, resulting  in state-of-the-art performance. Furthermore, the regressed transformation parameters from the  spatial transformer are available as an output and could be used for subsequent tasks. While we  only explore feed-forward networks in this work, early experiments show spatial transformers to be  powerful in recurrent models, and useful for tasks requiring the disentangling of object reference  frames, as well as easily extendable to 3D transformations (see Appendix A.3). 本文介绍了一种新的神经网络自包含模块——空间变压器。该模块可以放入网络中,对特征进行显式的空间转换,为神经网络建模数据开辟了新途径,并且可以在不改变损失函数的情况下以端到端方式学习。虽然cnn提供了一个令人难以置信的强大基线,但我们看到在多个任务中使用空间转换器的准确性有所提高,从而产生了最先进的性能。此外,从空间转换器返回的转换参数可作为输出,并可用于后续任务。虽然我们在这项工作中只探索了前馈网络,但早期的实验表明,空间转换器在循环模型中非常强大,对于需要解离对象参考框架的任务非常有用,而且很容易扩展到3D转换(见附录A.3)。

继续阅读