Abstract
Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations. 卷積神經網絡定義了一種非常強大的模型,但仍然受到限制,因為在計算和參數有效的方式下,缺乏對輸入資料的空間不變性。在這項工作中,我們引入了一個新的可學習子產品,空間轉換器,它明确地允許對網絡内的資料進行空間操作。這個可微子產品可以插入到現有的卷積架構中,使神經網絡能夠以特征映射本身為條件主動對特征映射進行空間變換,而無需任何額外的訓練監督或修改優化過程。我們表明,空間轉換器的使用會導緻模型學習到平移、縮放、旋轉和更一般的扭曲的不變性,進而在幾個基準測試和許多類轉換上獲得最先進的性能。
1 Introduction
Over recent years, the landscape of computer vision has been drastically altered and pushed forward through the adoption of a fast, scalable, end-to-end learning framework, the Convolutional Neural Network (CNN) [21]. Though not a recent invention, we now see a cornucopia of CNN-based models achieving state-of-the-art results in classification [19, 28, 35], localisation [31, 37], semantic segmentation [24], and action recognition [12, 32] tasks, amongst others.
A desirable property of a system which is able to reason about images is to disentangle object pose and part deformation from texture and shape. The introduction of local max-pooling layers in CNNs has helped to satisfy this property by allowing a network to be somewhat spatially invariant to the position of features. However, due to the typically small spatial support for max-pooling (e.g. 2 × 2 pixels) this spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps (convolutional layer activations) in a CNN are not actually invariant to large transformations of the input data [6, 22]. This limitation of CNNs is due to having only a limited, pre-defined pooling mechanism for dealing with variations in the spatial arrangement of data.
近年來,通過采用快速、可擴充、端到端學習架構——卷積神經網絡(CNN)[21],計算機視覺領域發生了翻天覆地的變化。雖然不是最近才發明的,但我們現在看到大量基于cnn的模型在分類[19,28,35]、定位[31,37]、語義分割[24]和動作識别[12,32]任務等方面取得了最先進的結果。
一個能夠對圖像進行推理的系統的一個理想特性是将物體的姿态和部分變形從紋理和形狀中分離出來。在cnn中引入局部最大池層有助于滿足這一特性,因為它允許網絡對特征的位置具有一定的空間不變性。然而,由于典型的對最大池的空間支援很小(例如:這種空間不變性僅在max-pooling和convolutions的深層層次上實作,而CNN中的中間特征映射(convolutional layer activation)對于輸入資料的大變換實際上并不是不變的[6,22]。cnn的這種局限性是由于隻有一種有限的、預定義的池機制來處理資料空間安排的變化。
In this work we introduce a Spatial Transformer module, that can be included into a standard neural network architecture to provide spatial transformation capabilities. The action of the spatial transformer is conditioned on individual data samples, with the appropriate behaviour learnt during training for the task in question (without extra supervision). Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample. The transformation is then performed on the entire feature map (non-locally) and can include scaling, cropping, rotations, as well as non-rigid deformations. This allows networks which include spatial transformers to not only select regions of an image that are most relevant (attention), but also to transform those regions to a canonical, expected pose to simplify recognition in the following layers. Notably, spatial transformers can be trained with standard back-propagation, allowing for end-to-end training of the models they are injected in. 在這項工作中,我們介紹了一個空間轉換器子產品,它可以包含在一個标準的神經網絡結構中,以提供空間轉換能力。空間轉換器的動作以個體資料樣本為條件,并在任務訓練中學習到适當的行為(沒有額外的監督)。與接受域是固定和局部的池化層不同,空間轉換器子產品是一種動态機制,通過為每個輸入樣本生成适當的轉換,可以主動地對圖像(或特征地圖)進行空間轉換。然後在整個特征圖(非局部)上執行轉換,可以包括縮放、剪切、旋轉以及非剛性變形。這使得包含空間變形器的網絡不僅可以選擇圖像中最相關的區域(注意),而且可以将這些區域轉換成規範的、預期的姿态,進而簡化以下層中的識别。值得注意的是,空間轉換器可以用标準的反向傳播進行訓練,允許對它們所注入的模型進行端到端的訓練。
Figure 1: The result of using a spatial transformer as the first layer of a fully-connected network trained for distorted MNIST digit classification. (a) The input to the spatial transformer network is an image of an MNIST digit that is distorted with random translation, scale, rotation, and clutter. (b) The localisation network of the spatial transformer predicts a transformation to apply to the input image. (c) The output of the spatial transformer, after applying the transformation. (d) The classification prediction produced by the subsequent fully-connected network on the output of the spatial transformer. The spatial transformer network (a CNN including a spatial transformer module) is trained end-to-end with only class labels – no knowledge of the groundtruth transformations is given to the system.
圖1:使用空間轉換器作為變形MNIST數字分類訓練的全連接配接網絡的第一層的結果。(a)空間變壓器網絡的輸入是被随機平移、縮放、旋轉和雜波扭曲的MNIST數字的圖像。(b)空間轉換器的定位網絡預測将對輸入圖像進行轉換。(c)空間變壓器應用變換後的輸出。(d)随後的全連接配接網絡在空間變壓器的輸出上産生的分類預測。空間變壓器網絡(包括空間變壓器子產品的CNN)隻使用類标簽進行端到端的訓練——沒有向系統提供關于groundtruth轉換的知識。
Spatial transformers can be incorporated into CNNs to benefit multifarious tasks, for example: (i) image classification: suppose a CNN is trained to perform multi-way classification of images according to whether they contain a particular digit – where the position and size of the digit may vary significantly with each sample (and are uncorrelated with the class); a spatial transformer that crops out and scale-normalizes the appropriate region can simplify the subsequent classification task, and lead to superior classification performance, see Fig. 1; (ii) co-localisation: given a set of images containing different instances of the same (but unknown) class, a spatial transformer can be used to localise them in each image; (iii) spatial attention: a spatial transformer can be used for tasks requiring an attention mechanism, such as in [14, 39], but is more flexible and can be trained purely with backpropagation without reinforcement learning. A key benefit of using attention is that transformed (and so attended), lower resolution inputs can be used in favour of higher resolution raw inputs, resulting in increased computational efficiency.
The rest of the paper is organised as follows: Sect. 2 discusses some work related to our own, we introduce the formulation and implementation of the spatial transformer in Sect. 3, and finally give the results of experiments in Sect. 4. Additional experiments and implementation details are given in Appendix A. 空間轉換器可以被納入CNN受益繁雜的任務,例如:(i)圖像分類:假設一個CNN訓練來執行多路圖像分類根據他們是否包含一個特定的數字,數字可能會有所不同的位置和大小明顯與每個樣本(和不相關的類);裁剪和尺度歸一化适當區域的空間轉換器可以簡化後續的分類任務,并導緻更高的分類性能,見圖1;(ii)共定位:給定一組包含相同(但未知)類的不同執行個體的圖像,空間轉換器可以用于在每個圖像中定位它們;(3)空間注意:空間轉換器可以用于需要注意機制的任務,如[14,39],但更靈活,可以單純用反向傳播進行訓練,無需強化學習。使用attention的一個關鍵好處是,轉換(是以參與)的低分辨率輸入可以用于更高分辨率的原始輸入,進而提高計算效率。
本文的其餘部分組織如下:第2節讨論了與我們相關的一些工作,第3節介紹了空間轉換器的設計和實作,最後給出了第4節的實驗結果。附錄A給出了更多的實驗和實作細節。
2 Related Work
In this section we discuss the prior work related to the paper, covering the central ideas of modelling transformations with neural networks [15, 16, 36], learning and analysing transformation-invariant representations [4, 6, 10, 20, 22, 33], as well as attention and detection mechanisms for feature selection [1, 7, 11, 14, 27, 29].
Early work by Hinton [15] looked at assigning canonical frames of reference to object parts, a theme which recurred in [16] where 2D affine transformations were modeled to create a generative model composed of transformed parts. The targets of the generative training scheme are the transformed input images, with the transformations between input images and targets given as an additional input to the network. The result is a generative model which can learn to generate transformed images of objects by composing parts. The notion of a composition of transformed parts is taken further by Tieleman [36], where learnt parts are explicitly affine-transformed, with the transform predicted by the network. Such generative capsule models are able to learn discriminative features for classification from transformation supervision. 在本節中,我們讨論了之前的相關工作,與神經網絡覆寫模型轉換的核心觀點(15、16,36),學習和分析transformation-invariant表示(4、6、10、20、22、33),以及注意力和檢測機制特征選擇(1、7、11、14,27歲,29)。
Hinton[15]的早期工作是将标準的參考架構配置設定給對象部件,這是[16]中反複出現的主題,在這裡,2D仿射轉換被模組化,以建立由轉換部件組成的生成模型。生成訓練方案的目标是轉換後的輸入圖像,輸入圖像與目标之間的轉換作為網絡的額外輸入。其結果是一個生成模型,該模型可以通過組成部件來學習生成轉換後的物體圖像。Tieleman[36]進一步提出了由轉換部分組成的概念,學習到的部分通過網絡預測的變換進行明确的仿射變換。這種生成膠囊模型能夠從轉換監督中學習判别特征進行分類。
The invariance and equivariance of CNN representations to input image transformations are studied in [22] by estimating the linear relationships between representations of the original and transformed images. Cohen & Welling [6] analyse this behaviour in relation to symmetry groups, which is also exploited in the architecture proposed by Gens & Domingos [10], resulting in feature maps that are more invariant to symmetry groups. Other attempts to design transformation invariant representations are scattering networks [4], and CNNs that construct filter banks of transformed filters [20, 33]. Stollenga et al. [34] use a policy based on a network’s activations to gate the responses of the network’s filters for a subsequent forward pass of the same image and so can allow attention to specific features. In this work, we aim to achieve invariant representations by manipulating the data rather than the feature extractors, something that was done for clustering in [9]. 在[22]中,通過估計原始圖像和轉換後圖像的表示之間的線性關系,研究了CNN表示對輸入圖像轉換的不變性和等效性。Cohen和Welling[6]分析了這種與對稱群相關的行為,這也被Gens和Domingos[10]提出的架構所利用,進而産生了對對稱群更不變的特征映射。設計變換不變表示的其他嘗試包括散射網絡[4]和構造變換濾波器組的CNNs[20,33]。Stollenga等人[34]使用一種基于網絡激活的政策來屏蔽網絡過濾器的響應,以便後續轉發相同的圖像,進而允許關注特定的特征。在這項工作中,我們的目标是通過操縱資料而不是特征提取器來實作不變表示,這在[9]中是為了聚類而做的。
Figure 2: The architecture of a spatial transformer module. The input feature map U is passed to a localisation network which regresses the transformation parameters θ. The regular spatial grid G over V is transformed to the sampling grid Tθ(G), which is applied to U as described in Sect. 3.3, producing the warped output feature map V . The combination of the localisation network and sampling mechanism defines a spatial transformer.
圖2:空間變壓器子產品的架構。輸入特征映射U被傳遞到一個定位網絡,該網絡回歸轉換參數θ。将規則空間網格G / V轉換為采樣網格Tθ(G),如3.3節所述,将采樣網格應用于U,産生扭曲的輸出特征映射V。定位網絡和抽樣機制的結合定義了一個空間轉換器。
Neural networks with selective attention manipulate the data by taking crops, and so are able to learn translation invariance. Work such as [1, 29] are trained with reinforcement learning to avoid the need for a differentiable attention mechanism, while [14] use a differentiable attention mechansim by utilising Gaussian kernels in a generative model. The work by Girshick et al. [11] uses a region proposal algorithm as a form of attention, and [7] show that it is possible to regress salient regions with a CNN. The framework we present in this paper can be seen as a generalisation of differentiable attention to any spatial transformation. 具有選擇性注意的神經網絡通過擷取作物來操縱資料,是以能夠學習翻譯不變性。像[1,29]這樣的工作通過強化學習進行訓練,以避免對可微分注意機制的需要,而[14]通過在生成模型中使用高斯核函數來使用可微分注意機制。Girshick等人的研究[11]使用區域建議算法作為注意的一種形式,[7]表明可以使用CNN回歸顯著區域。我們在本文中提出的架構可以看作是對任何空間變換的可微注意的推廣。
3 Spatial Transformers
In this section we describe the formulation of a spatial transformer. This is a differentiable module which applies a spatial transformation to a feature map during a single forward pass, where the transformation is conditioned on the particular input, producing a single output feature map. For multi-channel inputs, the same warping is applied to each channel. For simplicity, in this section we consider single transforms and single outputs per transformer, however we can generalise to multiple transformations, as shown in experiments. 在本節中,我們将描述空間轉換器的公式。這是一個可微子產品,它在一個單獨的前向過程中對特征映射進行空間變換,其中的變換以特定的輸入為條件,産生一個單獨的輸出特征映射。對于多通道輸入,對每個通道應用相同的翹曲。為簡單起見,在本節中,我們考慮每個變壓器的單一轉換和單一輸出,然而,我們可以推廣到多個轉換,如實驗中所示。
The spatial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation, first a localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden layers outputs the parameters of the spatial transformation that should be applied to the feature map – this gives a transformation conditional on the input. Then, the predicted transformation parameters are used to create a sampling grid, which is a set of points where the input map should be sampled to produce the transformed output. This is done by the grid generator, described in Sect. 3.2. Finally, the feature map and the sampling grid are taken as inputs to the sampler, producing the output map sampled from the input at the grid points (Sect. 3.3).
The combination of these three components forms a spatial transformer and will now be described in more detail in the following sections. 空間變換機構分為三部分,如圖2所示。按照計算順序,首先定位網絡(第3.1節)擷取輸入特征地圖,并通過若幹隐藏層輸出應該應用于特征地圖的空間轉換參數——這将在輸入上給出一個有條件的轉換。然後,使用預測的轉換參數來建立一個采樣網格,該網格是一組應該對輸入映射進行采樣以産生轉換後的輸出的點。這是由第3.2節中描述的網格生成器完成的。最後,将特征映射和采樣網格作為采樣器的輸入,從網格點的輸入産生采樣的輸出映射(第3.3節)。這三個元件的組合形成了一個空間轉換器,下面幾節将對其進行更詳細的描述。