前言
論文:Very Deep Convolutional Networks for Large-Scale Image Recognition
一、INTRODUCTION
- 作者在這篇論文中提出了如何解決卷積網絡深度的問題。作者先是固定好模型的其他部分參數,然後通過增加一些卷積層來逐漸增大網絡的深度,而且作者的卷積層使用的是非常小的卷積核
In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layer
- 模型最終的performance是非常好的,不僅在分類任務中表現達到了state-of-the-art,在location任務中效果也是非常好的
As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines
二、CONVNET CONFIGURATIONS
- 在訓練過程中,模型的輸入被固定為224x224的三通道
During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image
- 作者使用了非常小的3x3的卷積核,這也是其模型特色之一
we use filters with a very small receptive field: 3 × 3
- 作者還使用了1x1的卷積核,1x1的卷積核可以把它看作是一個線性變換
In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels
- 如果某個卷積層的卷積核為3x3,那麼其padding将其設定為1
the padding is 1 pixel for 3 × 3 conv. layers
- 池化層使用的是2x2的pool切步長為2
Max-pooling is performed over a 2 × 2 pixel window, with stride 2
- 卷積層之後有三個全連接配接層,前兩層隐藏層單元數為4096個,第三次隐藏單元數為4096
A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000- way ILSVRC classification and thus contains 1000 channels (one for each class
- 該網絡使用的激活函數均為relu
All hidden layers are equipped with the rectification
三、CONFIGURATIONS
- 這篇論文設計了(A-E)個模型,這些模型基本上參數都是一緻的,不一樣的就是深度了,模型A有11層(8個卷積3個全連接配接)到模型E有19層(16個卷積層3個全連接配接層),依次遞增,如下
- 作者提出模型的顯著特點就是卷積核特别小,但也正是因為卷積核變小了,才使得網絡可以變得更深。因為2層3x3的卷積層的感受野實際上就相當于一層5x5的卷積核的感受野。3層3x3的卷積層感受野相當于1層7x7卷積核的感受野
- 為什麼要使用三個3x3的卷積層而不直接使用1個7x7的卷積核的呢?第一個原因就是将一層non-linear分解成三層non-linear,可以增強網絡的表達能力。第二個原因就是這樣做可以減少網絡的參數,如果使用前者替換後者,可以減少大概19%的參數。
First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by 3(32)C2= 27C^2 weights; at the same time, a single 7 × 7 conv. layer would require 72C2= 49C^2parameters, i.e. 81% more
- 也可以把他們看成是一種正則化,因為網絡必須把一層分解成三層。
This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).
- 其實小卷積核的網絡在之前就有人做了,但是他們的網絡不是很深,直到2014年,Goodfellow使用了11層的網絡才表明增加網絡深度确實可以提高performance
Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance
四、CLASSIFICATION FRAMEWORK
- 在訓練過程中,作者使用的biathsize為256,并且momentum的mu=0.9
The batch size was set to 256, momentum to 0.9.
- 網絡在卷積層中使用了L2正則,在全連接配接層使用了dropout,并且keep-prop=0.5
The training was regularised by weight decay (the L2penalty multiplier set to 5·10−4) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5)
- 學習率被初始化為0.01,當驗證集上的準确率下降時,學習率開始衰減
The learning rate was initially set to 10−2, and then decreased by a factor of 10 when the validation set accuracy stopped improving
- 作者在訓練的過程中,發現它的網絡比别人提出深度較少的網絡訓練收斂更快,作者猜測可能是因為将較大卷積核分解成三層較小卷積核帶有正則化作用
the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers
- 網絡參數的初始化工作上非常重要的,一個糟糕的初始化會讓網絡無法學習。為了規避這個問題,作者使用了預訓練技術,因為模型A的深度要稍微小一點,相對好訓練一些,是以先随機初始化模型A然後進行訓練,完成訓練後在使用模型A中的參數對模型E進行初始化。不過作者也提到了,使用Xavier初始化可以替代預訓練過程。
五、CLASSIFICATION EXPERIMENTS
- 從圖中可以看出用不用LRN實際上對performance影響不大
- 随着網絡的深度的增加,模型的performance越來越好。對比C模型和D模型可以看出使用1x1的的卷積核沒有使用3x3的卷積核效果好。但是對比B和C可以看出,1x1的卷積核效果還是更好一些的
- 當網絡深度到19層時,其performance就陷入了瓶頸了,可能是因為資料集size的影響。作者推測,如果資料集足夠多的話,網絡越深越好。
- 作者嘗試對比13層的B模型和将用5x5作替換的8層B‘模型,最終結果是後者在top-err上比前者高了7%,這就說明了,将一層擁有較大卷積核的卷積層分解成擁有較小卷積核的卷積層确實可以提高performance
-
後來作者嘗試綜合7個模型的預測機率,最終test-error達到了7%,如果綜合兩
個最好的模型的預測機率,最終的test-error可達6,8%。
- 作者又将其模型與業界其他的成果進行對比,略輸于GoogleNet,但是比之前幾年ILSVRC比賽中第一名的模型都要好一些。如果隻看單個模型的performance,VGG效果略好于單個的GoogleLeNet