天天看点

论文阅读:SiamMask

一、对这篇论文的简单理解

1、SiamMask结合两种网络的任务,一个是目标跟踪网络,另一个是目标分割网络,对于vot指标,SiamMask以精度取胜,对于vos指标,SiamMask以速度取胜,以前的一些视频分割网络只能fps基本是1以下,但这个网络可以达到55fps,强!

2、以前的vot大部分是在线学习一个分类器,然后后面的帧可以根据情况更新模板再分类,是tracking-by-detection,比如kcf之类的方法;而Siamese系列的跟踪网络是学习第一帧的模板与搜索区域的相似性–response map,(ROW),把模板的feature map当作是卷积核与搜索区域的feature map进行卷积操作,这里用来depth-wise卷积产生多通道的ROW,可以编码更丰富的信息。

3、一个ROW只预测一个mask,和MaskRcnn不一样,它是预测k个mask,k是类别;还要说明一点,box的分支是每个预测k个box,但这个K是提前设置的不同尺寸不同长宽比的框的数量。

4、如何根据生成的mask产生用于vot指标的框对评测也有影响,论文结合精度和速度选用了MBR

5、训练数据集: COCO [31], ImageNet-VID [47] and YouTube-VOS [58]

6、在vot方面,超越了DaSiamRPN和kcf,decay小,更适合长视频

二、性能比较,论文中给出的数据

1、网络结构图,但实际不是如论文中figure2这么简单的,还有refine模块和adjust层,在附录里有具体展示,这里也给出::

论文阅读:SiamMask
论文阅读:SiamMask
论文阅读:SiamMask

2、与vot方面的sota工作对比

论文阅读:SiamMask

3、与vos方面的sota工作对比

论文阅读:SiamMask

4、ablation studies

论文阅读:SiamMask

三、对自己有益的原句摘抄

1、

It finds use in a wide range of scenarios

such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition.

这里是指视频跟踪

2、

In this paper, we aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask,

a simple multi-task learning approach that can be used

to address both problems.

3、

To achieve this goal, we simultaneously train a Siamese

network on three tasks, each corresponding to a different

strategy to establish correspondances between the target object and candidate regions in the new frames.

4、

Performance of Correlation Filter-based

trackers has then been notably improved with the adoption of multi-channel formulations [24, 20], spatial constraints [25, 13, 33, 29] and deep features (e.g. [12, 51])

5、这个不太理解,需要继续学习

In order to exploit consistency between video frames,

several methods propagate the supervisory segmentation

mask of the first frame to the temporally adjacent ones via

graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In

particular, Bao et al. [1] recently proposed a very accurate

method that makes use of a spatio-temporal MRF in which

temporal dependencies are modelled by optical flow, while

spatial dependencies are expressed by a CNN

6、

The loss function Lmask (Eq. 3) for the mask prediction task is a binary

logistic regression loss over all RoWs:

7、

In contrast to semantic segmentation methods in the style of FCN [32] and Mask RCNN [17], which maintain explicit spatial information

throughout the network, our approach follows the spirit

of [43, 44] and generates masks starting from a flattened representation of the object.

8、这个也不理解

Similarly to most VOS

methods, in case of multiple objects in the same video

(DAVIS-2017) we simply perform multiple inferences

9、

Interestingly, the refinement approach of Pinheiro et al. [44]

is very important for the contour accuracy FM, but less so

for the other metrics.

继续阅读