CNN Visualization Algorithm CAM1.Overview 2.CAM3.Experimental Effects 4.References

This article describes the CAM (Class Activation Mapping) algorithm proposed in 2016, which can visualize the feature map of a convolutional neural network and map it to the importance of obtaining different regions in the original image. CAM uses the parameters of the Global Average Pooling GAP (Global Average Pooling) and the last layer of softmax to linearly weight the feature map to obtain areas of focus of the model for different categories.

CNN Visualization Algorithm CAM1.Overview 2.CAM3.Experimental Effects 4.References

CAM algorithm effect

The CAM algorithm was proposed in the paper Learning Deep Features for Discriminative Localization, and the authors found that although cnn networks may not provide the location of objects during training, they still have strong localization feature capabilities, as shown in the figure above. The above figure is the effect of CAM operation, you can see that for brushing this category, CNN can effectively locate the toothbrush, and for saw trees, CNN can effectively locate the chainsaw.

However, the CNN is usually followed by a fully connected layer, and the authors believe that the fully connected layer will affect the positioning ability of the CNN. Therefore, the CAM algorithm is proposed, which replaces the fully connected layer with a global average pooled GAP to retain the positioning feature capability of the model. Global averaging pooling GAP usually plays a role in regularization to prevent overfitting during training, the difference between GAP and other pooling methods is shown in the following figure, and the global pooling method is to expand the scope of pooling to the size of the entire feature map.

Various pooling methods

The authors also tested the targeting capability of CAM on the ILSVRC dataset, which used weakly supervised targeting training and got a top-5 error rate of 37.1%, while AlexNet, which used fully supervised targeting training, had a top-5 error rate of 34.2%, which is very close.

The difference between weak supervision target positioning training and full supervision target positioning training is shown in the following figure, that is, when weak supervision, the picture only has a category without a target box, while the full supervision picture has a target box.

Weak supervision of targeting training

SCHEMATIC OF THE CAM MODEL

The figure above is a schematic diagram of the MODEL of CAM, followed by GAP in the last convolutional layer, and then classified by softmax layer after GAP. The number of channels of the last convolutional layer in the figure is n, so the vector dimension obtained after GAP is n, corresponding to each channel. The w1, ..., wn in the figure refers to the weights of the Softmax layer, which corresponds to the weights of a class class (the class in the figure is the Australian terrier, or Australian Terrier).

The feature map output of the last convolutional layer contains n channels, and the weights of this class are used to weight the n channels of the feature map and then map to the original picture to obtain the importance of different regions. Here, because the feature map and the original map size are different, to map, cam directly uses resize to scale the feature map to the same size as the original diagram. Let's look at some of the formulas for CAM.

For an image, the score for its belonging to category c can be calculated using the following formula, where k represents the channel of the last convolutional layer, xy represents the coordinates of the feature map, f represents the feature map, and w represents the weight of Softmax:

CAM pictures belong to the scoring formula for category c

For category c, we can get a map of the importance of each area of the picture, the formula is as follows, that is, the feature map is weighted and summed:

CAM picture importance mapping chart formula

Below is the effect of the CAM visualization, which can be more clearly positioned to the more relevant places in the image.

CAM visualizations

At the same time, cams generate different heat maps for different categories, as shown below, for category domes, the heat map focuses on the top of the building.

CAM visualizations for different categories

The authors first verify the classification effect of the CAM algorithm, as shown in the following table, where GAP refers to the use of global average pooling instead of full connections (that is, CAM algorithms), and GMP refers to the use of global maximum pooling instead of full connections. You can see that the model error rate increases after using CAM, with AlexNet being the most severe.

Classification effect of the CAM algorithm

The authors then verified the effect of the CAM algorithm in the weak supervision target positioning task, and the results are shown in the following table, weakly refers to the use of weak supervision training, and full refers to the use of full supervision training, you can see that cam has a good effect, and the fully supervised training of AlexNet effect is similar.

Targeting effect of the CAM algorithm

< h1 class="pgc-h-arrow-right" data-track="28" >4.Reference</h1>

Learning Deep Features for Discriminative Localization