laitimes

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

author:Heart of the Machine Pro

Reports from the Heart of the Machine

Editors: Chen Ping, Xiao Zhou

Researchers from FAIR and UC Berkeley re-examined the design space and tested the limits of what pure ConvNet could reach, showing that convolutional neural networks performed as well as visual Transformers.

The rapid development of visual recognition began with the introduction of Vision transformer (ViT), which soon replaced the traditional convolutional neural network (ConvNet) as the most advanced image classification model. On the other hand, ViT models have many challenges in a series of computer vision tasks, including object detection and semantic segmentation. Therefore, some researchers have proposed layered Transformers (such as Swin Transformer), and they reintroduce ConvNet priors, which makes transformers practically feasible as a general visual backbone and shows excellent performance on various visual tasks.

However, the effectiveness of this hybrid approach is still largely due to the inherent strengths of Transformer, rather than the inductive bias inherent in convolution. In this work, researchers from FAIR and UC Berkeley re-examined the design space and tested the limits of what a pure ConvNet could reach. Researchers have gradually "modernized" the standard ResNet to the design of a visual Transformer, and in the process identified several key components that contribute to performance differences.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior
  • Address of the paper: https://arxiv.org/pdf/2201.03545.pdf
  • Code address: https://github.com/facebookresearch/ConvNeXt

The researchers named this series of pure ConvNet models ConvNeXt. Built entirely from the standard ConvNet module, ConvNeXt achieves competitive results with Transformer in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and superior to Swin Transformer in COCO detection and ADE20K segmentation, while maintaining the simplicity and effectiveness of standard ConvNet.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

It is worth mentioning that the paper, as Zhuang Liu, is a co-author of the famous DenseNet, and won the CVPR 2017 Best Paper Award for its paper "Densely Connected Convolutional Networks". The author is a work by ResNeXt.

Upgrade the convolutional neural network

The study combed the trajectory of convolutional neural networks from ResNet to Transformer-like convolutional neural networks. The study considered two model sizes based on FLOPs, one is the ResNet-50/Swin-T mechanism, which has FLOPs of about 4.5×10^9, and the other is the ResNet-200/Swin-B mechanism, which has FLOPs of about 15.0×10^9. For simplicity, the study demonstrated experimental results using a ResNet-50/Swin-T complexity model.

In order to explore the design of Swin Transformer and the simplicity of the standard convolutional neural network, the study started from the ResNet-50 model and first trained it using a similar training method for training visual Transformers, and the results compared with the original ResNet-50 showed a great improvement in performance, and the improved results were used as a baseline.

The study then developed a series of design decisions, summarized as 1) macro design, 2) ResNeXt, 3) inverted bottlenecks, 4) convolutional kernel sizes, and 5) various layer-by-layer micro-designs. Figure 2 below shows the implementation process and results of each step of the Upgrade Network, all models trained and evaluated on ImageNet-1K. Since network complexity and final performance are closely related, the study roughly controlled floPs during exploration.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

Training methods

In addition to the design of the network architecture, the training process can also affect the final performance. Not only did Visual Transformer bring some new architectural design decisions and modules, but it also introduced multiple training methods (such as the AdamW Optimizer) to the visual field. This is mainly related to the optimization strategy and the related hyperparameter settings.

Therefore, the first step in the study was to train a baseline model using the visual Transformer training program (ResNet50/200). The 2021 Ross Wightman et al. paper An improved training procedure in timm demonstrates a set of training methods that significantly improve the performance of ResNet-50 models. In this paper, the researchers used a training method similar to DeiT and Swin Transformer. Training expands from ResNet's original 90 epochs to 300 epochs.

The study used data enhancement techniques such as AdamW Optimizer, Mixup, Cutmix, RandAugment, Random Erasing, and regularization schemes such as Random Depth and Label Smoothing. This improved training protocol increased the performance of the ResNet-50 model from 76.1% to 78.8% (+2.7%), meaning that a large part of the performance difference between traditional ConvNet and visual Transformer may be due to training techniques.

Macro design

The second step of the study analyzes the current macro network design of Swin Transformer. Swin Transformer uses a multi-stage design similar to a convolutional neural network, with each stage having a different feature map resolution. Two of the important design considerations are the phase calculation ratio and the backbone architecture.

On the one hand, the original design of the cross-stage computational distribution in ResNet was largely influenced by experimentation. Swin-T, on the other hand, follows the same principle, but the phase calculation ratio is slightly different. The study adjusted the number of blocks per stage from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, s3) so that floPs aligned with Swin-T. This increased model accuracy from 78.8% to 79.4%.

Typically, the backbone architecture focuses on how the network processes the input image. Due to the inherent redundancy in natural images, common architectures aggressively downsample input images to the appropriate feature map size in standard ConvNet and visual Transformer. The standard ResNet includes a 7×7 convolutional layer with 2 steps and a maximum pool, which allows the input image to be downsampled by a factor of 4. While Visual Transformer uses a "patchify" strategy, Swin Transformer uses a similar "patchify" layer, but uses a smaller patch size to accommodate the multi-stage design of the architecture. The study replaced the ResNet backbone architecture with a patchify layer implemented using a convolutional layer of 4×4, step size 4, and the accuracy rate improved from 79.4% to 79.5%. This suggests that the backbone architecture of ResNet can be replaced with a simpler patchify layer.

ResNeXt-ify

In the third step, the study tried to take the idea of ResNeXt [82], which has a better FLOPs/accuracy trade-off than regular ResNet. The core component is the group convolution, where the convolutional filters are divided into different groups. The guiding principle of ResNeXt is "use more groups, expand the width". More precisely, ResNeXt employs grouped convolutional layers for 3×3 convolutional layers in the bottleneck block. Since floPs are significantly reduced, this extends the network width to compensate for capacity loss.

The study used a special case of packet convolution, depthwise convolution, where the number of groups is equal to the number of channels. Deep convolution has been used by MobileNet [32] and Xception [9]. The researchers noted that deep convolution is similar to a weighted summation operation in self-attention, operating on a per-channel basis, i.e. mixing information only in spatial dimensions. The use of deep convolution effectively reduces the FLOPs of the network. Following the strategy proposed in ResNeXt, the study increased the network width to the same number of channels as Swin-T (from 64 to 96). With the increase in FLOPs (5.3G), network performance reached 80.5%.

Reverse the bottleneck

An important design in Transformer is to create an inversion bottleneck where the hidden dimension of the MLP block is four times wider than the input dimension, as shown in Figure 4 below. Interestingly, this design of transformer is related to the inverted bottleneck design with an extension ratio of 4 used in convolutional neural networks.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

So the fourth step of the study explores the design of inverted bottlenecks. As shown in Figure 3 below, although the FLOPs of the deep convolutional layer have increased, the FLOPs of the entire network have decreased to 4.6G due to the significant reduction in the FLOPs of the shortcut 1×1 convolutional layer of the downsampled residual blocks. Interestingly, this will slightly improve performance from 80.5% to 80.6%. In the ResNet-200/Swin-B scenario, this step resulted in additional performance gains – from 81.9% to 82.6% – and a reduction in FLOPs.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

Convolutional kernel size

In the fifth step, the study explored the role of large convolutional nuclei. The most notable feature of visual Transformer is its non-local self-attention, with each layer having a global sensory field. While convolutional neural networks already use large convolutional kernels, the gold standard (VGGNet [62]) is a convolutional layer of stacked small convolutional nuclei (3×3). Although Swin Transformer reintroduces the local window into the self-attention block, the window size is at least 7×7, which is significantly larger than the ResNe(X)t convolutional kernel size of 3×3. The study therefore revisits the role of using large convolutional nuclei in convolutional neural networks.

Move the depth convolutional layer up. To explore large convolutional kernels, a prerequisite is to move up the position of the deep convolutional layer (as shown in Figure 3(c)). Similarly, msA blocks are placed in Transformer before the MLP layer. With an inverted bottleneck already in place, there are fewer complex, inefficient module (MSAs, large convolutional kernels) channels, while efficient, dense layers 1×1 will do the heavy lifting. So this intermediate step reduces floPs to 4.1G, resulting in a temporary performance degradation of 79.9%.

Increase the convolutional kernel. After the above preparations, the use of larger convolutional nuclei has significant advantages. The study experimented with several convolutional kernel sizes: 3, 5, 7, 9, 11. The performance of the network improved from 79.9% (3×3) to 80.6% (7×7), while the FLOPs of the network remained largely unchanged.

In addition, the researchers observed that the benefit of larger convolutional nuclei is that saturation points are reached at 7×7 and validated this behavior in large volume models. When the convolutional kernel size exceeds 7×7, the ResNet-200 mechanism model shows no further gains. So the study used 7×7 deep convolution in each block.

At this point, the upgrade and adjustment of the macro-scale network architecture has been completed.

Micro design

Next, the researchers explored some architectural differences at the microscopic scale—most of the exploration here is done at the hierarchy, with a focus on the specific choice of activation functions and normalization layers.

Replace ReLU with GELU. Over time, researchers have developed many activation functions, but ReLU is still widely used in ConvNet due to its simplicity and effectiveness. ReLU is also used as an activation function in the original Transformer. GELU can be thought of as a smoother variant of ReLU and is used in the most advanced Transformers, including Google's BERT and OpenAI's GPT-2, as well as ViT, among others. The study found that ReLU can also be replaced with GELU in ConvNet with the same accuracy rate (80.6%).

Fewer activation functions. A small difference between Transformer and ResNet blocks is that Transformer has fewer activation functions. As shown in Figure 4, the study eliminated all GELU layers from the residual blocks, except for the GELU layer between the two 1×1 layers, which replicates the style of the Transformer blocks. This process improves the results by 0.7% to 81.3%, which is actually comparable to Swin-T performance.

Fewer normalization layers. Transformer blocks also typically have fewer normalization layers. Here, the study removed two BatchNorm (BN) layers, leaving only one BN layer before conv 1 × 1. This further boosts performance to 81.4%, already surpassing the results of Swin-T. Note that each block in the study had fewer normalization layers than Transformer, and the researchers found that adding an additional BN layer at the beginning of the block did not improve performance.

Replace BN with LN. BatchNorm(BN) is an important part of ConvNet because it improves convergence and reduces overfitting. However, BN also has a lot of intricate things that can adversely affect the performance of the model. Researchers have tried many times to develop alternatives, but BN is still the preferred method for most vision tasks. Replacing BN directly with LN in the original ResNet is not very good. As the network architecture and training techniques improved, the study revisited the impact of using LN instead of BN and concluded that the ConvNet model had no difficulty training with LN; in fact, the performance improved somewhat, yielding 81.5% accuracy.

Separate downsampling layer. In ResNet, spatial downsampling is achieved through the residual block at the beginning of each stage, using a 3×3 convolution of stride = 2. In Swin Transformer, a split downsampling layer is added between stages. The study explored a similar strategy in which the researchers undersampled spatially using a 2×2 convolutional layer of stride =2. Surprisingly, this change can lead to different training outcomes. Further investigation showed that adding a normalization layer where spatial resolution changes helps stabilize training. The study could improve accuracy to 82.0 percent, significantly more than Swin-T's 81.3 percent. The study used a split downsampling layer to obtain the final model, ConvNeXt. A comparison of the ResNet, Swin, and ConvNeXt block structures is shown in Figure 4.

A comparison of the detailed architectural specifications for ResNet-50, Swin-T, and ConvNeXt-T is shown in Table 9.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

experiment

ImageNet lab evaluation

The study constructed different ConvNeXt variants, ConvNeXtT/S/B/L, with similar complexities to Swin-T/S/B/L, which can be evaluated for benchmarking experiments. In addition, the study built a larger ConvNeXt-XL to further test the scalability of ConvNeXt. The difference between the different variant models is the number of channels and modules, as follows:

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

Results ImageNet-1K: The following table compares ConvNeXt with the Transformer variants DeiT, Swin Transformer, and RegNets and EfficientNets.

The results are available: ConvNeXt achieves competitive results with the ConvNet baseline (RegNet and EfficientNet) in terms of accuracy-calculation trade-offs and inference throughput; ConvNeXt also performs across the board over Swin Transformers with similar complexity; and ConvNeXts compared to Swin Transformers Higher throughput is also available in the absence of specialized modules such as shift windows or relative position bias.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

ImageNet-22K: The following table (refer to the table above) shows the results of the model fine-tuned from the ImageNet-22K pre-training. These experiments are important because it is widely believed that visual Transformers have fewer inductive biases and can therefore perform better than ConvNet at large-scale pre-training. The study shows that when pre-trained with large datasets, a properly designed ConvNet is not inferior to a visual Transformer—ConvNeXt still performs as well as a similarly sized Swin Transformer with slightly higher throughput. In addition, the ConvNeXt-XL model proposed in the study achieves an accuracy rate of 87.8%—a considerable improvement over ConvNeXt-L at 384^2, proving that ConvNeXt is a scalable architecture.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

Isotropic ConvNeXt versus ViT: In ablation experiments, researchers constructed isotropic ConvNeXt-S/B/L using the same feature size as ViT-S/B/L (384/768/1024). The depth is set to 18/18/36 to match the number of parameters and FLOPs, and the block structure remains the same (Figure 4). The result of ImageNet-1K at 224^2 resolution is shown in Table 2. The results show that ConvNeXt performs on par with ViT, suggesting that the ConvNeXt block design is still competitive when used in a non-hierarchical model.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

Downstream task evaluation

Object detection and segmentation study on COCO: This study is based on ConvNeXt and fine-tuned Mask R-CNN and Cascade Mask R-CNN on COCO datasets. Table 3 compares the results of Swin Transformer, ConvNeXt, and traditional ConvNets such as ResNeXt for object detection and instance segmentation. The results show that convNeXt performs as well as Swin Transformer in different model complexities.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

ADE20K-based semantic segmentation: In Table 4, the study studies validated mIoU with multiscale testing. The ConvNeXt model can achieve competitive performance on different model capacities, further validating the effectiveness of the ConvNeXt design.

Transformer has become the new overlord? FAIR and other redesigned pure convolutional ConvNet, performance is superior

Read on