天天看點

Paper:論文解讀《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中國大學生提出AdaBound的神經網絡優化算法(二)

2、CONVOLUTIONAL NEURAL NETWORK

     Using DenseNet-121 (Huang et al., 2017) and ResNet-34 (He et al., 2016), we then consider the task  of image classification on the standard CIFAR-10 dataset. In this experiment, we employ the fixed  budget of 200 epochs and reduce the learning rates by 10 after 150 epochs.  

     DenseNet :We first run a DenseNet-121 model on CIFAR-10 and our results are shown in Figure 3.  We can see that adaptive methods such as ADAGRAD, ADAM and AMSGRAD appear to perform  better than the non-adaptive ones early in training. But by epoch 150 when the learning rates are  decayed, SGDM begins to outperform those adaptive methods. As for our methods, ADABOUND  and AMSBOUND, they converge as fast as adaptive ones and achieve a bit higher accuracy than  SGDM on the test set at the end of training. In addition, compared with their prototypes, their  performances are enhanced evidently with approximately 2% improvement in the test accuracy.  

     ResNet :Results for this experiment are reported in Figure 3. As is expected, the overall performance  of each algorithm on ResNet-34 is similar to that on DenseNet-121. ADABOUND and  AMSBOUND even surpass SGDM by 1%. Despite the relative bad generalization ability of adaptive  methods, our proposed methods overcome this drawback by allocating bounds for their learning  rates and obtain almost the best accuracy on the test set for both DenseNet and ResNet on CIFAR-10.

     然後利用DenseNet-121 (Huang et al.2017)和ResNet-34 (He et al.2016)對CIFAR-10标準資料集進行圖像分類。在這個實驗中,我們使用200個epoch的固定預算,在150個epoch後将學習率降低10個。

     DenseNet:我們首先在CIFAR-10上運作DenseNet-121模型,結果如圖3所示。我們可以看到,ADAGRAD、ADAM和AMSGRAD等自适應方法在早期訓練中表現得比非自适應方法更好。但是到了曆元150,當學習速率衰減時,SGDM開始優于那些自适應方法。對于我們的方法ADABOUND和AMSBOUND,它們收斂速度和自适應方法一樣快,并且在訓練結束時的測試集上達到比SGDM稍高的精度。此外,與原型機相比,其性能得到了顯著提高,測試精度提高了約2%。

     ResNet:實驗結果如圖3所示。正如預期的那樣,ResNet-34上的每個算法的總體性能與DenseNet-121上的相似。ADABOUND和AMSBOUND甚至超過SGDM 1%。盡管自适應方法的泛化能力相對較差,但我們提出的方法克服了這一缺點,為其學習速率配置設定了界限,在CIFAR-10上對DenseNet和ResNet的測試集都獲得了幾乎最佳的準确率。

Paper:論文解讀《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中國大學生提出AdaBound的神經網絡優化算法(二)

3、RECURRENT NEURAL NETWORK  

    Finally, we conduct an experiment on the language modeling task with Long Short-Term Memory  (LSTM) network (Hochreiter & Schmidhuber, 1997). From two experiments above, we observe that our methods show much more improvement in deep convolutional neural networks than in perceptrons.  Therefore, we suppose that the enhancement is related to the complexity of the architecture  and run three models with (L1) 1-layer, (L2) 2-layer and (L3) 3-layer LSTM respectively. We train  them on Penn Treebank, running for a fixed budget of 200 epochs. We use perplexity as the metric  to evaluate the performance and report results in Figure 4.

      We find that in all models, ADAM has the fastest initial progress but stagnates in worse performance  than SGD and our methods. Different from phenomena in previous experiments on the image classification  tasks, ADABOUND and AMSBOUND does not display rapid speed at the early training  stage but the curves are smoother than that of SGD.

      我們發現,在所有模型中,ADAM的初始進展最快,但在性能上停滞不前,不如SGD和我們的方法。與以往在圖像分類任務實驗中出現的現象不同,ADABOUND和AMSBOUND在訓練初期的速度并不快,但曲線比SGD平滑。

Paper:論文解讀《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中國大學生提出AdaBound的神經網絡優化算法(二)

      Comparing L1, L2 and L3, we can easily notice a distinct difference of the improvement degree.  In L1, the simplest model, our methods perform slightly 1.1% better than ADAM while in L3, the  most complex model, they show evident improvement over 2.8% in terms of perplexity. It serves as  evidence for the relationship between the model’s complexity and the improvement degree.

      對比L1、L2和L3,我們可以很容易地發現改善程度的顯著差異。在最簡單的模型L1中,我們的方法比ADAM的方法略好1.1%,而在最複雜的模型L3中,我們的方法在複雜的方面明顯優于2.8%。為模型的複雜性與改程序度之間的關系提供了依據。

實驗結果分析

      To investigate the efficacy of our proposed algorithms, we select popular tasks from computer vision and natural language processing. Based on results shown above, it is easy to find that ADAM and AMSGRAD usually perform similarly and the latter does not show much improvement for most cases. Their variants, ADABOUND and AMSBOUND, on the other hand, demonstrate a fast speed of convergence compared with SGD while they also exceed two original methods greatly with respect to test accuracy at the end of training. This phenomenon exactly confirms our view mentioned in Section 3 that both large and small learning rates can influence the convergence.

      Besides, we implement our experiments on models with different complexities, consisting of a per- ceptron, two deep convolutional neural networks and a recurrent neural network. The perceptron used on the MNIST is the simplest and our methods perform slightly better than others. As for DenseNet and ResNet, obvious increases in test accuracy can be observed. We attribute this differ- ence to the complexity of the model. Specifically, for deep CNN models, convolutional and fully connected layers play different parts in the task. Also, different convolutional layers are likely to be responsible for different roles (Lee et al., 2009), which may lead to a distinct variation of gradients of parameters. In other words, extreme learning rates (huge or tiny) may appear more frequently in complex models such as ResNet. As our algorithms are proposed to avoid them, the greater enhance- ment of performance in complex architectures can be explained intuitively. The higher improvement degree on LSTM with more layers on language modeling task also consists with the above analysis.

      為了研究我們提出的算法的有效性,我們從計算機視覺和自然語言進行中選擇流行的任務。根據上面顯示的結果,不難發現ADAM和AMSGRAD的表現通常是相似的,而AMSGRAD在大多數情況下并沒有太大的改善。另一方面,它們的變體ADABOUND和AMSBOUND與SGD相比具有較快的收斂速度,同時在訓練結束時的測試精度也大大超過了兩種原始方法。這一現象正好印證了我們在第3節中提到的觀點,學習速率的大小都會影響收斂。

      此外,我們還對不同複雜度的模型進行了實驗,包括一個per- ceptron模型、兩個深度卷積神經網絡模型和一個遞歸神經網絡模型。MNIST上使用的感覺器是最簡單的,我們的方法比其他方法稍好一些。DenseNet和ResNet的測試精度明顯提高。我們把這種不同歸因于模型的複雜性。具體來說,對于深度CNN模型,卷積層和全連通層在任務中扮演不同的角色。此外,不同的卷積層可能負責不同的角色(Lee et al.2009),這可能導緻參數梯度的明顯變化。換句話說,極端的學習速率(巨大或微小)可能在ResNet等複雜模型中出現得更頻繁。由于我們的算法是為了避免這些問題而提出的,是以可以直覺地解釋在複雜體系結構中性能的提高。LSTM在語言模組化任務上的層次越多,改程序度越高,也與上述分析一緻。

PS:因為時間比較緊,部落客翻譯的不是特别盡善盡美,如有錯誤,請指出,謝謝!

繼續閱讀