FAST-RCNN
这篇文章主要介绍比起传统的R-CNN的优点,它是R-CNN和SPPnet的进阶版。主要摘自这里
缺点
R-CNN
- 训练要经过多阶段。首先要提取特征微调ConvNet,再用线性SVM处理proposal,计算得到的ConvNet特征,然后进行用bounding box回归。
- 训练时间和空间开销大。要从每一张图像上提取大量proposal,还要从每个proposal中提取特征,并存到磁盘中。
- 测试时间开销大。同样是要从每个测试图像上提取大量proposal,再从每个proposal中提取特征来进行检测过程,可想而知是很慢的。
SPPnet:
- SPP已有一定的速度提升,它在ConvNet的最后一个卷积层才提取proposal,但是依然有不足之处。和R-CNN一样,它的训练要经过多个阶段,特征也要存在磁盘中,另外,SPP中的微调只更新spp层后面的全连接层,对很深的网络这样肯定是不行的。
概述
针对R-CNN和SPPnet缺点,FRCNN有如下优点:
比R-CNN更高的检测质量(mAP)
把多个任务的损失函数写到一起,实现单级训练
在训练时可更新所有层
不需要在磁盘中存储特征
整个结构如下图所示
Fast R-CNN architecture. An input image and multi- ple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.
大概过程就是,用selective search在一张图片中生成约2000个object proposal,即RoI。把它们整体输入到全卷积的网络中,在最后一个卷积层上对每个ROI求映射关系,并用一个RoI pooling layer来统一到相同的大小。继续经过两个全连接层(FC)得到特征向量。特征向量经由各自的FC层,得到两个输出向量:第一个是分类,使用softmax,第二个是每一类的bounding box回归。
ROI POOLING LAYER
ROI(Regions of interest)
是SPP pooling层的简化版,只有一级“金字塔”
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small fea- ture map with a fixed spatial extent of H × W (e.g., 7 × 7), where H and W are layer hyper-parameters that are inde- pendent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).
RoI max pooling works by dividing the h × w RoI win- dow into an H × W grid of sub-windows of approximate size h/H × w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pool- ing is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in
SPPnets
in which there is only one pyramid level. We use the pooling sub-window calculation given inSPPnets
.
[这里有池化的具体介绍](http://blog.csdn.net/mao_kun/article/details/50507376)
空间金字塔池化可以把任何尺度的图像的卷积特征转化成相同维度,这不仅可以让CNN处理任意尺度的图像,还能避免cropping和warping操作,导致一些信息的丢失,具有非常重要的意义。
PRE-TRAINED NETWORD
用了三个预训练的ImageNet网络
(CaffeNet/VGG_CNN_M_1024/VGG16),预训练的网络初始化FRCNN要经过三次变形:
最后一个max pooling层替换为RoI pooling层,设置 H 和 W 与第一个全连接层兼容(e.g.H = W = 7 for VGG16)。
First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H =W =7forVGG16).
最后一个全连接层和softmax(原本是1000个类)替换为softmax的对K+1个类别的分类层,和bounding box 回归层。
Second,the network’s last fully connected layer and soft- max (which were trained for 1000-way ImageNet classifi- cation) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 cat- egories and category-specific bounding-box regressors).
输入修改为两种数据:一组N个图形,R个RoI,batch size和ROI数、图像分辨率都是可变的。
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
微调
论文中提到了几个点:
- Multi-task loss
- Mini-batch sampling
- Back-propagation through RoI pooling layers
- SGD hyper-parameters
直接看上面那个博客的中文分析吧