一、對這篇論文的簡單了解
1、SiamMask結合兩種網絡的任務,一個是目标跟蹤網絡,另一個是目标分割網絡,對于vot名額,SiamMask以精度取勝,對于vos名額,SiamMask以速度取勝,以前的一些視訊分割網絡隻能fps基本是1以下,但這個網絡可以達到55fps,強!
2、以前的vot大部分是線上學習一個分類器,然後後面的幀可以根據情況更新模闆再分類,是tracking-by-detection,比如kcf之類的方法;而Siamese系列的跟蹤網絡是學習第一幀的模闆與搜尋區域的相似性–response map,(ROW),把模闆的feature map當作是卷積核與搜尋區域的feature map進行卷積操作,這裡用來depth-wise卷積産生多通道的ROW,可以編碼更豐富的資訊。
3、一個ROW隻預測一個mask,和MaskRcnn不一樣,它是預測k個mask,k是類别;還要說明一點,box的分支是每個預測k個box,但這個K是提前設定的不同尺寸不同長寬比的框的數量。
4、如何根據生成的mask産生用于vot名額的框對評測也有影響,論文結合精度和速度選用了MBR
5、訓練資料集: COCO [31], ImageNet-VID [47] and YouTube-VOS [58]
6、在vot方面,超越了DaSiamRPN和kcf,decay小,更适合長視訊
二、性能比較,論文中給出的資料
1、網絡結構圖,但實際不是如論文中figure2這麼簡單的,還有refine子產品和adjust層,在附錄裡有具體展示,這裡也給出::
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsICM38FdsYkRGZkRG9lcvx2bjxiNx8VZ6l2cs0TRE5UMrRlT4tmeORzbywEMW1mY1RzRapnTtxkb5ckYplTeMZTTINGMShUYfRHelRHLwEzX39GZhh2css2RkBnVHFmb1clWvB3MaVnRtp1XlBXe0xyayFWbyVGdhd3LcV2Zh1Wa9M3clN2byBXLzN3btg3Pn5GcuATN0AjNwMjMwEDOwkTMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
2、與vot方面的sota工作對比
3、與vos方面的sota工作對比
4、ablation studies
三、對自己有益的原句摘抄
1、
It finds use in a wide range of scenarios
such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition.
這裡是指視訊跟蹤
2、
In this paper, we aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask,
a simple multi-task learning approach that can be used
to address both problems.
3、
To achieve this goal, we simultaneously train a Siamese
network on three tasks, each corresponding to a different
strategy to establish correspondances between the target object and candidate regions in the new frames.
4、
Performance of Correlation Filter-based
trackers has then been notably improved with the adoption of multi-channel formulations [24, 20], spatial constraints [25, 13, 33, 29] and deep features (e.g. [12, 51])
5、這個不太了解,需要繼續學習
In order to exploit consistency between video frames,
several methods propagate the supervisory segmentation
mask of the first frame to the temporally adjacent ones via
graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In
particular, Bao et al. [1] recently proposed a very accurate
method that makes use of a spatio-temporal MRF in which
temporal dependencies are modelled by optical flow, while
spatial dependencies are expressed by a CNN
6、
The loss function Lmask (Eq. 3) for the mask prediction task is a binary
logistic regression loss over all RoWs:
7、
In contrast to semantic segmentation methods in the style of FCN [32] and Mask RCNN [17], which maintain explicit spatial information
throughout the network, our approach follows the spirit
of [43, 44] and generates masks starting from a flattened representation of the object.
8、這個也不了解
Similarly to most VOS
methods, in case of multiple objects in the same video
(DAVIS-2017) we simply perform multiple inferences
9、
Interestingly, the refinement approach of Pinheiro et al. [44]
is very important for the contour accuracy FM, but less so
for the other metrics.