读 AlignedReID

AlignedReID: Surpassing Human-Level Performance in Person Re-Identification[^1]

这是旷视的一篇新论文,暂时挂在 arXiv 上

提出了现有的基于 CNN 学习全局特征的缺点:

不准确的检测框影响特征学习
姿态变化和非刚体变化使得度量变得困难
遮挡问题
外表特征比较相似

本文提出了一种方法 AlignedReID

AlignedReID 学习的任然是全局特征, 但学习的时候会执行自动部分对齐, 不需要额外的监督或者明确的姿态估计.

实验在 Market1501, CUHK03, MARS 和 CUHK-SYSU 数据集上均达到目前最好的成果.

在 Market1501 和 CUHK03 数据集上得到比人类高的结果…

AlignedReID

学习时,分为局部特征和全局特征,在局部特征时引入最短路径损失(shortest path loss)计算两组 local features.

测试(inference)时,抛弃局部特征,只提取全局特征. 主要原因是 local features 是由 global features 得来的.

而且没有局部特征的匹配可以节省很大的开销.

We find that only applying the global feature is almost as good as combining global and local features. In other words, the global feature itself, with the aid of local features learning, can greatly address the draw backs we mentioned above, in our new joint learning framework.

使用 AlignedReID,对一张图产生一个全局的特征作为最终输出, 使用 L2距离 作为相似性度量, 全局特征和本地特征是一起学习的.

如图:

对于每个图像,使用 CNN 提取特征, 本文使用的是 Resnet50. (最后一层卷积层大小 $C \times H \times W$, $C$ 为通道数, $H \times W$是大小.) 全局特征是直接通过在特征图上做global pooling得到. 局部特征是通过 horizontal pooling (global pooling with horizontal orientation) , 先对global feature (C * d)每一行提取特征, 然后做 1 * 1 卷积, 把通道由 C 减到 c,这样, 每一个local feature(c * d vector) 都可以表示成一张行人图片的一个水平方向的分割,最终一张图片可以得到一个 global feature 和 H 个 local features.

计算 local feature 的距离

f g 表示两张图片, $f_i$ 和 $g_j$ 表示图片的第 i 或 j 行. (猜测,文章中没明确表示)

即 $f = { f_1,f_2,\dots,f_H }^T$ , $g = { g_1,g_2,\dots,g_H }^T$

对 f 和 g 值归一化.

$$ d_{i,j} = {{e^{\lVert f_i-g_j \rVert_2 - 1 }} \over {e^{\lVert f_i-g_j \rVert_2 + 1 }}} \qquad i,j \in 1,2,3,\dots,H $$

然后定义距离矩阵 D 是由第(i,j)个距离为元素组成的矩阵.

定义 local feature 为(1,1)到(H,H)的最短路径之和.

具体最短路径为:

$S_{i,j}$表示从(1, 1)到(i, j)的最短路径的总距离.$S_{H,H}$表示最短路径的总距离.

由上图可以看出 A 和 B 之间, A 的 1 和 B 的 4 已经对齐, 称之为 corresponding body parts, 其余没对齐的例如(A1,B1)之间就称之为non-corresponding body parts, non-corresponding body parts 这一部分也是比较有用的, 在之前的计算 $d_{i,j}$的公式中可以看出, 没对齐的部分有较大的 L2 距离, 会导致计算是梯度接近于0, 对整体的最短路径的计算贡献较小, 即两幅图之间的局部距离主要取决于corresponding body parts.

距离的损失函数

损失函数基于 TriHard Loss[^2] 的思想, 把最短距离相差最多的另一行人和最短距离最相近的另一个行人组成三元组.

注意在训练的时候, 取得是global 和 local distance 两种计算相似度.

但在测试的时候, 这里取的最短距离是指对 global distance 取最短. 并不是同时使用 local distance 原因是效率高, 其次这两个距离并没有实质性的区别(个人理解).

Mutual Learning

为了增强模型效果
前人贡献
- Distillation-Based models 将知识从预训练好的 large teacher network 传递给 small student network
- 只是采用了 Kullback-leibler 方法[^3] 计算 distance between classification probabilities
本文贡献
- 同时训练一组 student models ,相互教学
- 采取了多个损失函数(4个)
  
  损失函数一共分为四组:
1. metric loss: global & local distance 决定
2. metric mutual loss: global distance 决定
3. classification loss
4. classification mutual loss: 由 KL divergence for classification 决定
  
  $$ L_M = {1\over N^2} \sum \limits_{i}^{N} \sum \limits_{j}^{N}([ZG(M_{ij}^{\theta_1})-M_{ij}^{\theta_2}]^2 + [M_{ij}^{\theta_1}+ZG(M_{ij}^{\theta_2})]^2) $$
  
  其中 $ZG(\cdot)$ 表示 zero gradient function, 在计算梯度的时候讲变量视为常量, 在学习的时候停止反向传播(没动具体怎么做)
  
  计算二阶梯度:
  
  $$ {\partial ^2 L_M \over \partial M_{ij}^{\theta_1} \partial M_{ij}^{\theta_2}} = 0 $$
  
  优点:
5. 可以加速收敛
6. 使用 ZG 比使用 mutual loss 的效果更好

总结

从实验结果来看..

基本上不给 reid 其他人活路了.. 去年做到60%以上就可以发论文..今年70-80%.. 这篇直接到了90+%…

[^1]: X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun. AlignedReID: Surpassing Human-Level Performance in Person Re-Identification. arXiv :1711.08184, 2017

[^2]: A. Hermans, L. Beyer, and B. Leibe.In defense of the triplet loss for person re-identification.arXiv preprint arXiv:1703.07737, 2017.

[^3]: Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. arXiv preprint arXiv:1706.00384, 2017.