天天看點

論文閱讀:SiamMask

一、對這篇論文的簡單了解

1、SiamMask結合兩種網絡的任務,一個是目标跟蹤網絡,另一個是目标分割網絡,對于vot名額,SiamMask以精度取勝,對于vos名額,SiamMask以速度取勝,以前的一些視訊分割網絡隻能fps基本是1以下,但這個網絡可以達到55fps,強!

2、以前的vot大部分是線上學習一個分類器,然後後面的幀可以根據情況更新模闆再分類,是tracking-by-detection,比如kcf之類的方法;而Siamese系列的跟蹤網絡是學習第一幀的模闆與搜尋區域的相似性–response map,(ROW),把模闆的feature map當作是卷積核與搜尋區域的feature map進行卷積操作,這裡用來depth-wise卷積産生多通道的ROW,可以編碼更豐富的資訊。

3、一個ROW隻預測一個mask,和MaskRcnn不一樣,它是預測k個mask,k是類别;還要說明一點,box的分支是每個預測k個box,但這個K是提前設定的不同尺寸不同長寬比的框的數量。

4、如何根據生成的mask産生用于vot名額的框對評測也有影響,論文結合精度和速度選用了MBR

5、訓練資料集: COCO [31], ImageNet-VID [47] and YouTube-VOS [58]

6、在vot方面,超越了DaSiamRPN和kcf,decay小,更适合長視訊

二、性能比較,論文中給出的資料

1、網絡結構圖,但實際不是如論文中figure2這麼簡單的,還有refine子產品和adjust層,在附錄裡有具體展示,這裡也給出::

論文閱讀:SiamMask
論文閱讀:SiamMask
論文閱讀:SiamMask

2、與vot方面的sota工作對比

論文閱讀:SiamMask

3、與vos方面的sota工作對比

論文閱讀:SiamMask

4、ablation studies

論文閱讀:SiamMask

三、對自己有益的原句摘抄

1、

It finds use in a wide range of scenarios

such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition.

這裡是指視訊跟蹤

2、

In this paper, we aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask,

a simple multi-task learning approach that can be used

to address both problems.

3、

To achieve this goal, we simultaneously train a Siamese

network on three tasks, each corresponding to a different

strategy to establish correspondances between the target object and candidate regions in the new frames.

4、

Performance of Correlation Filter-based

trackers has then been notably improved with the adoption of multi-channel formulations [24, 20], spatial constraints [25, 13, 33, 29] and deep features (e.g. [12, 51])

5、這個不太了解,需要繼續學習

In order to exploit consistency between video frames,

several methods propagate the supervisory segmentation

mask of the first frame to the temporally adjacent ones via

graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In

particular, Bao et al. [1] recently proposed a very accurate

method that makes use of a spatio-temporal MRF in which

temporal dependencies are modelled by optical flow, while

spatial dependencies are expressed by a CNN

6、

The loss function Lmask (Eq. 3) for the mask prediction task is a binary

logistic regression loss over all RoWs:

7、

In contrast to semantic segmentation methods in the style of FCN [32] and Mask RCNN [17], which maintain explicit spatial information

throughout the network, our approach follows the spirit

of [43, 44] and generates masks starting from a flattened representation of the object.

8、這個也不了解

Similarly to most VOS

methods, in case of multiple objects in the same video

(DAVIS-2017) we simply perform multiple inferences

9、

Interestingly, the refinement approach of Pinheiro et al. [44]

is very important for the contour accuracy FM, but less so

for the other metrics.

繼續閱讀