Deeper and Wider Siamese Networks for Real-Time Visual Tracking--CVPR‘19 阅读_综合

作者分析了Siamese跟踪器为什么不能很好地利用较深的网络，并设计一种新的网络架构可以使得该类跟踪器可以利用更深以及更宽的网络来提升性能。

Abstract.

Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet, which does not fully take advantage of the capability of modern deep neural networks.

基于Siamese的跟踪器虽然在精度和速度上取得了平衡，但是由于其多采用较浅的网络不能完全发挥深度神经网络的优势。

In this paper, we investigate how to leverage deeper and wider convolutional neural networks to enhance tracking robustness and accuracy.

本文作者研究了如何利用更深更广的卷积神经网络来增强跟踪算法的鲁棒性和准确性。

We observe that direct replacement of backbones with existing powerful architectures, such as ResNet and Inception, does not bring improvements. The main reasons are that 1) large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 2) the network padding for convolutions induces a positional bias in learning.

作者观察到用现有的强大架构(如ResNet和Inception)直接替换主干网络并不会带来性能上的改进。主要原因是1)神经元感受野的大幅增加导致特征识别能力和定位精度下降; 2)卷积网络的填充导致了在学习中的位置上的偏差。

To address these issues, we propose new residual modules to eliminate the negative impact of padding, and further design new architectures using these modules with controlled receptive field size and network stride. The designed architectures are lightweight and guarantee real-time tracking speed when applied to SiamFC and SiamRPN.

为了解决以上问题，作者提出一种残差模块来消除padding带来的负面影响。并进一步使用这些视野和网络跨度受控制的模块来设计一个新的架构。在应用到SiamFC和SiamRPN时，设计的网络结构是轻量级的，从而保证了实时的跟踪速度。

Experiments show that solely due to the proposed network architectures, our SiamFC+ and SiamRPN+ obtain up to 9.8%/5.7% (AUC), 23.3%/8.8% (EAO) and 24.4%/25.0% (EAO) relative improvements over the original versions on the OTB-15, VOT-16 and VOT-17 datasets, respectively.

实验结果显示网络架构的改进可以带来性能上的显著提升。

Background and Motivation:

本文主要是很好的处理了跟踪问题中一个很奇特的现象：“随着网络层数的层数（用现有的 ResNet, Inception 等网络来替换常用的 Backbone net，例如 AlexNet），跟踪结果不增反而降低的情况”。如图1所示.

作者发现如下的几个参数，对跟踪结果的影响，非常巨大：（1）the receptive field size of neurons;（2）network stride;（3）feature padding 。

1. 感受野决定了用于计算 feature 的图像区域。较大的感受野，提供了更好的 image context 信息，而一个较小的感受野可能无法捕获目标的结构信息；

2. 网络的步长，影响了定位准确性的程度，特别是对小目标而言；与此同时，它也控制了输出 feature map 的大小，从而影响了 feature 的判别性和检测精度。

3. 对于一个全卷积的结构来说，feature padding 对卷积来说，会在模型训练中，引入潜在的位置偏移，从而使得当一个目标移动到接近搜索范围边界的时候，很难做出准确的预测。这三个因素，同时造成了 Siamese Tracker 无法很好的从更顶尖的模型中收益。

本文中，作者尝试从设计新的网络结构的基础上，来解决上述问题，从而使得 SiamNet 获得更好的跟踪性能。创新点主要在于：

1. 作者基于 the "boottleneck" residual block 来提出一组 cropping-inside residual (CIR) units。该模块可以消除 padding 带来的影响，从而组织卷积核学习 position bias；

2. 我们设计了两种网络结构，通过堆叠 the CIR units，称为 Deeper and Wider networks。在这个网络中，步长和神经感受野被用于增强定位的准确性；

3. 作者将所设计的 backbone network 用到 SiamFC 和 SiamRPN 网络中。作者的实验证明，在多个数据集上，都可以得到大幅度的提升。另外一个优势是：本文所设计的网络结构是轻量级的，允许跟踪器可以实现实时跟踪。

参考

1. Deeper and Wider Siamese Networks for Real-Time Visual Tracking （https://openaccess.thecvf.com/content_CVPR_2019/papers/Zhang_Deeper_and_Wider_Siamese_Networks_for_Real-Time_Visual_Tracking_CVPR_2019_paper.pdf）

2. https://blog.csdn.net/baidu_36669549/article/details/86291495

3. https://www.cnblogs.com/wangxiaocvpr/p/10533069.html

4. https://baijiahao.baidu.com/s?id=1627060097929333328&wfr=spider&for=pc