laitimes

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Reports from the Heart of the Machine

Editors: Zhang Qian, Xiao Zhou

Do I necessarily need an FPN for object detection? Yesterday, researchers such as Yanghao Li and Kaiming He from Facebook AI Research uploaded a new paper on arXiv to prove the feasibility of target detection using a common, non-hierarchical visual Transformer as a backbone network. They hope that this research will draw attention to ordinary trunk detectors.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Research overview

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Thesis link: https://arxiv.org/pdf/2203.16527.pdf

Current target detectors typically consist of a backbone feature extractor unrelated to the detection task and a set of necks and heads that contain detection-specific prior knowledge. Common components in the neck/head may include region of interest (RoI) manipulation, region candidate network (RPN) or anchor, feature pyramid network (FPN), etc. If the design of the neck / head for a specific task is decoupled from the design of the trunk, they can develop in parallel. Empirically, object detection research benefits from a large number of independent explorations of the general backbone and the detection-specific module. For a long time, these backbones have been multiscale, layered architectures due to the actual design of convolutional networks, which has seriously influenced the design of neck/head for multi-scale (e.g., FPN) target detection.

Over the past year, Visual Transformer (ViT) has become a powerful pillar of visual identity. Unlike typical ConvNets, the original ViT was a simple, non-hierarchical architecture that always maintained a single-scale feature map. Its pursuit of "minimalism" has encountered challenges when applied to object detection, for example, how can we handle multiscale objects in downstream tasks with a simple backbone of upstream pre-training? Is Simple ViT too inefficient for high-resolution image detection? One solution to abandon this quest is to reintroduce layered designs in the trunk. Such solutions, such as Swin Transformer and other networks, can inherit the ConvNet-based detector design and have been successful.

In this work, researchers such as He Kaiming pursued a different direction: exploring the use of only ordinary, non-hierarchical backbone target detectors. If this direction is successful, target detection using only the original ViT backbone will be possible. In this direction, the pre-training design will be decoupled from fine-tuning requirements, and the independence of upstream and downstream tasks will remain, as in the Case of ConvNet-based research. This direction also follows ViT's philosophy to some extent, that is, to reduce inductive bias in the pursuit of common features. Since non-local self-attention computations can learn translational isomorphic features, they can also learn scale isomorphic features from some form of supervised or self-supervised pre-training.

In this study, the researchers say, their goal was not to develop new components, but to overcome the challenges described above with minimal adjustments. Specifically, their detector builds a simple feature pyramid from the last feature map of a normal ViT backbone (as shown in Figure 1). This approach abandons the requirements of FPN design and layered backbone. To efficiently extract features from high-resolution images, their detector uses simple non-overlapping window attention (no shifting). They use a small number of cross-window blocks to propagate information, which can be global attention or convolution. These adjustments are only made during the fine-tuning process and do not alter the pre-training.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

This simple design yielded surprising results. The researchers found that the design of the FPN is not necessary in the case of using the ordinary ViT backbone, and its benefits can be effectively obtained by a simple pyramid constructed from a large stride (16), a single-scale graph. They also found that window attention was sufficient as long as the information could spread well across windows in a small number of layers.

What's even more surprising is that in some cases, the researchers developed a common trunk detector called "ViTDet" that is comparable to the leading hierarchical trunk detectors (e.g., Swin, MViT). By masking the autoencoder (MAE) pre-training, their normal backbone detector can be superior to the hierarchical detector with supervised pre-training on imageNet-1K/21K (as shown in Figure 3 below).

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

On larger size models, this gain is more pronounced. The excellent performance of this detector is observed under different object detector frameworks, including Mask R-CNN, Cascade Mask R-CNN, and enhanced versions of them.

Experimental results on the COCO dataset show that an AP^box with a viTDet detector pre-trained with a normal ViT-Huge backbone using unlabeled ImageNet-1K can reach 61.3. They also demonstrated ViTDet's competitive results on the long-tail LVIS detection dataset. While these powerful results may come in part from the effectiveness of MAE pretraining, this study suggests that ordinary trunk detectors may be promising, challenging the entrenched position of the stratified backbone in target detection.

Method details

The goal of the study is to eliminate hierarchical constraints on the backbone network and use the ordinary backbone network for target detection. Therefore, the goal of the study is to adapt the simple backbone network to the target detection task during fine-tuning with minimal modification. After the changes, in principle we can apply any detector head, and the researcher chose to use the Mask R-CNN and its extensions.

Simple feature pyramid

FPNs are a common solution for building in-network pyramids for object detection. If the backbone network is layered, the motivation of the FPN is to combine early high-resolution features with later stronger features. This is achieved in the FPN through top-down and lateral connections, as shown on the left of Figure 1.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

If the backbone network is not hierarchical, then the basis of the FPN motivation disappears because all feature maps in the backbone network have the same resolution. The study used only the last feature map in the backbone network, as it should have the most powerful features.

The researchers applied a set of convolutional or deconvolutions to the last feature map in parallel to generate a multiscale feature map. Specifically, they used the default ViT feature map with a scale of 1/16 (stride = 16), which can be shown on the right of Figure 1, a process known as the "simple feature pyramid."

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

The strategy of constructing a multiscale feature map from a single feature map is related to the strategy of the SSD, but the scenario of this study involves upsampling deep, low-resolution feature maps. In hierarchical backbone networks, upsampling is usually aided by horizontal connections, but researchers have found experimentally that horizontal connections are not necessary in ordinary ViT backbone networks, and simple deconvolution is sufficient. The researchers speculate that this is because ViTs can rely on location embeddings to encode locations, and high-dimensional ViT patch embeddings do not necessarily discard information.

As shown in the figure below, the study compares this simple feature pyramid to two FPN variants that are also built on an ordinary backbone network. In the first variant, the backbone network is artificially divided into multiple stages to mimic the phases of a hierarchical backbone network and apply horizontal and top-down connections (Figure 2(a)). The second variant is similar to the first, but only uses the last feature map (Figure 2(b)). The study suggests that these FPN variants are not required.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Backbone network tuning

Object detectors benefit from high-resolution input images, but across the backbone network, computing global self-attention is very memory-intensive and slow. The study focused on scenarios where the pre-trained backbone network performs global self-attention and then adapts to higher-resolution inputs during fine-tuning. This is in contrast to recent methods that directly modify attention computation using backbone network pre-training. The study's scenario allows researchers to use the original ViT backbone network for detection without having to redesign the pre-training architecture.

The study explored window attention using cross-window blocks. During fine-tuning, given a high-resolution feature map, the study divided it into regular non-overlapping windows. Self-attention is calculated within each window, which in the original Transformer is referred to as "restricted" self-attention.

Unlike Swin, this method does not "shift" windows across layers. To allow information to propagate, the study used a very small number of blocks (defaulting to 4) that can span windows. The researchers divided the pre-trained backbone network into a subset of 4 blocks evenly (e.g., ViT-L with 24 blocks, each containing 6 blocks) and applied a propagation strategy in the last block of each subset. The researchers analyzed the following two strategies:

Global propagation. The policy performs global self-attention in the last block of each subset. Due to the small number of global blocks, memory and computational costs are feasible. This is similar to mixed window attention used in conjunction with FPNs in (Li et al., 2021).

Convolutional propagation. The strategy adds an extra convolutional block after each subset as an alternative. A convolutional block is a residual block consisting of one or more convolutions and an identity shortcut. The last layer in the block is initialized to zero, so the initial state of the block is an identity. Initializing the block to identity allows the study to insert it anywhere in the pre-trained backbone network without breaking the initial state of the backbone network.

This backbone network tuning is very simple, and makes the detection fine-tuning compatible with global self-attention pre-training, so there is no need to redesign the pre-training architecture.

Experimental results

Ablation studies

In the ablation study, the researchers came to the following conclusions:

1) A simple pyramid of features is enough. In Table 1, they compare the feature pyramid building strategy shown in Figure 2.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

2, with the help of several propagation blocks, window attention is enough. Table 2 summarizes the trunk tuning method proposed in this article. In short, various propagation methods can yield substantial benefits compared to a baseline with only window attention and no cross-window propagation blocks ("none" in the figure).

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection
He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

3. Masking autoencoders can provide a powerful pre-training backbone. Table 4 compares the strategies for master intervention training.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Contrast with layered trunks

Table 5 below shows the results of the comparison with the hierarchical backbone network.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Figure 3 below shows the accuracy of several models versus model size, FLOPs, and test time.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Comparison with the previous system

Table 6 below shows several methods of system-level comparison results on COCO datasets.

He Kaiming's new paper: only using ViT as the backbone can also do a good job of target detection

Read on