一、对这篇论文的简单理解
1、SiamMask结合两种网络的任务,一个是目标跟踪网络,另一个是目标分割网络,对于vot指标,SiamMask以精度取胜,对于vos指标,SiamMask以速度取胜,以前的一些视频分割网络只能fps基本是1以下,但这个网络可以达到55fps,强!
2、以前的vot大部分是在线学习一个分类器,然后后面的帧可以根据情况更新模板再分类,是tracking-by-detection,比如kcf之类的方法;而Siamese系列的跟踪网络是学习第一帧的模板与搜索区域的相似性–response map,(ROW),把模板的feature map当作是卷积核与搜索区域的feature map进行卷积操作,这里用来depth-wise卷积产生多通道的ROW,可以编码更丰富的信息。
3、一个ROW只预测一个mask,和MaskRcnn不一样,它是预测k个mask,k是类别;还要说明一点,box的分支是每个预测k个box,但这个K是提前设置的不同尺寸不同长宽比的框的数量。
4、如何根据生成的mask产生用于vot指标的框对评测也有影响,论文结合精度和速度选用了MBR
5、训练数据集: COCO [31], ImageNet-VID [47] and YouTube-VOS [58]
6、在vot方面,超越了DaSiamRPN和kcf,decay小,更适合长视频
二、性能比较,论文中给出的数据
1、网络结构图,但实际不是如论文中figure2这么简单的,还有refine模块和adjust层,在附录里有具体展示,这里也给出::
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsICM38FdsYkRGZkRG9lcvx2bjxiNx8VZ6l2cs0TRE5UMrRlT4tmeORzbywEMW1mY1RzRapnTtxkb5ckYplTeMZTTINGMShUYfRHelRHLwEzX39GZhh2css2RkBnVHFmb1clWvB3MaVnRtp1XlBXe0xyayFWbyVGdhd3LcV2Zh1Wa9M3clN2byBXLzN3btg3Pn5GcuATN0AjNwMjMwEDOwkTMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
2、与vot方面的sota工作对比
3、与vos方面的sota工作对比
4、ablation studies
三、对自己有益的原句摘抄
1、
It finds use in a wide range of scenarios
such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition.
这里是指视频跟踪
2、
In this paper, we aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask,
a simple multi-task learning approach that can be used
to address both problems.
3、
To achieve this goal, we simultaneously train a Siamese
network on three tasks, each corresponding to a different
strategy to establish correspondances between the target object and candidate regions in the new frames.
4、
Performance of Correlation Filter-based
trackers has then been notably improved with the adoption of multi-channel formulations [24, 20], spatial constraints [25, 13, 33, 29] and deep features (e.g. [12, 51])
5、这个不太理解,需要继续学习
In order to exploit consistency between video frames,
several methods propagate the supervisory segmentation
mask of the first frame to the temporally adjacent ones via
graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In
particular, Bao et al. [1] recently proposed a very accurate
method that makes use of a spatio-temporal MRF in which
temporal dependencies are modelled by optical flow, while
spatial dependencies are expressed by a CNN
6、
The loss function Lmask (Eq. 3) for the mask prediction task is a binary
logistic regression loss over all RoWs:
7、
In contrast to semantic segmentation methods in the style of FCN [32] and Mask RCNN [17], which maintain explicit spatial information
throughout the network, our approach follows the spirit
of [43, 44] and generates masks starting from a flattened representation of the object.
8、这个也不理解
Similarly to most VOS
methods, in case of multiple objects in the same video
(DAVIS-2017) we simply perform multiple inferences
9、
Interestingly, the refinement approach of Pinheiro et al. [44]
is very important for the contour accuracy FM, but less so
for the other metrics.