【CV-Paper 04】Inception V1-2014_综合

论文原文：LINK
论文年份：2014
论文被引：23544(20/08/2020)

文章目录

Going deeper with convolutions
Abstract
1 Introduction
2 Related Work
3 Motivation and High Level Considerations
4 Architectural Details
5 GoogLeNet
6 Training Methodology
7 ILSVRC 2014 Classification Challenge Setup and Results
8 ILSVRC 2014 Detection Challenge Setup and Results
9 Conclusions

Going deeper with convolutions

Abstract

We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

我们提出了一种代号为Inception的深度卷积神经网络体系结构，该体系结构负责为ImageNet大规模视觉识别挑战赛2014（ILSVRC14）设置分类和检测的最新技术水平。该体系结构的主要特点是网络内部计算资源的利用率得到提高。这是通过精心设计的设计实现的，该设计允许在保持计算预算不变的情况下增加网络的深度和宽度。为了优化质量，架构决策基于Hebbian原则和多尺度处理的直觉。在我们提交的ILSVRC14中使用的一种特定化身叫GoogLeNet，它是一个22层深的网络，其质量在分类和检测的背景下进行评估。

–

1 Introduction

In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks [10], the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12× fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].

在过去的三年中，主要是由于深度学习的发展，更具体地是卷积网络[10]，图像识别和目标检测的质量一直在飞速发展。一个令人鼓舞的消息是，这种进步的大部分不只是功能更强大的硬件，更大的数据集和更大的模型的结果，而且主要是新思想，算法和改进的网络结构的结果。除了用于检测目的的同一个比赛的分类数据集以外，例如，ILSVRC 2014比赛的前几名都没有使用任何新的数据源。实际上，我们向ILSVRC 2014提交的GoogLeNet提交的参数实际上比两年前Krizhevsky等人[9]的获奖架构少了12倍，但准确性却更高。在目标检测中最大的收获不是来自单独使用深度网络或更大的模型，而是来自深度架构和经典计算机视觉的协同作用，例如Girshick等人的R-CNN算法[6]。

Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.

另一个值得注意的因素是，随着移动和嵌入式计算的不断发展，我们算法的效率，尤其是其耗电和内存使用，变得越来越重要。值得注意的是，导致本文设计的深层体系结构的考虑因素包括此因素，而不是单纯地将精度数字固定下来。对于大多数实验，这些模型的设计目的是在推理时保持15亿次乘法运算的计算预算，这样它们就不会成为纯粹的学术好奇心，而是可以在现实世界中使用，甚至在大型数据集上，价格合理。

In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, on which it significantly outperforms the current state of the art.

在本文中，我们将专注于代号为Inception的高效的用于计算机视觉的深度神经网络架构，该架构的名称源自Lin等[12]在论文《Network in network》中的网络以及著名的“我们需要更深”[1]。在我们的案例中，“深度”一词有两种不同的含义：首先，在某种意义上，我们以“Inception module”的形式引入了新的组织层次，在更直接的意义上是网络的扩大深度。通常，人们可以将Inception模型视为[12]的逻辑高潮，同时可以从Arora等[2]的理论工作中获得启发和指导。该架构的优势在ILSVRC 2014分类和检测挑战中得到了实验验证，在此方面其性能明显优于当前水平。

2 Related Work

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and maxpooling) are followed by one or more fully-connected layers. V ariants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.

从LeNet-5 [10]开始，卷积神经网络（CNN）通常具有一种标准结构——堆叠的卷积层（可选地，随后进行对比度归一化和最大池化），然后是一个或多个全连接层。在图像分类文献中，这种基本设计的变化非常普遍，迄今为止，在MNIST，CIFAR上以及在ImageNet分类挑战中，效果最佳[9，21]。对于较大的数据集，例如Imagenet，最近的趋势是增加层数[12]和层大小[21、14]，同时使用Dropout[7]解决过度拟合的问题。

Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19]. Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] use a series of fixed Gaborfiltersofdifferentsizes in order to handle multiple scales, similarly to the Inception model. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception model are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.

尽管担心最大池化层会导致丢失准确的空间信息，但与[9]相同的卷积网络结构也已成功用于定位[9，14]，目标检测[6，14，18，5]和人类姿态估计[19]。 Serre等人[15]受到灵长类动物视觉皮层神经科学模型的启发，使用一系列不同大小的固定Gabor滤波器，以处理多个尺度，类似于Inception模型。但是，与文献[15]的固定2层深度模型相反，学习了Inception模型中的所有滤波器。此外，在Inception层中重复了很多次，在GoogLeNet模型的情况下，导致了22层深度模型。

Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. When applied to convolutional layers, the method could be viewed as additional 1×1 convolutional layers followed typically by the rectified linear activation [9]. This enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty.

Network-in-Network 是Lin等人[12]提出的一种方法，为了增加神经网络的表示能力。当应用于卷积层时，该方法可以看作是附加的1×1卷积层，通常是经过整流的线性激活[9]。这使它可以轻松集成到当前的CNN管道中。我们在体系结构中大量使用此方法。但是，在我们的设置中，1×1卷积具有双重目的：最关键的是，它们主要用作降维模块以消除计算瓶颈，否则将限制我们网络的规模。这不仅增加了网络的深度，还增加了网络的宽度，而没有明显的性能损失。

The current leading approach for object detection is the Regions with Convolutional Neural Networks (R-CNN) proposed by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: to first utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion, and to then use CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.

当前用于目标检测的领先方法是Girshick等人提出的具有卷积神经网络的区域（R-CNN）[6]。 R-CNN将整体检测问题分解为两个子问题：首先以与类别无关的方式将诸如颜色和??超像素一致性之类的低级提示用于潜在的目标建议，然后使用CNN分类器在那些位置识别目标类别。这种两阶段（two stage）方法利用了具有低级提示的边界框（bounding box）分割的准确性以及最新的CNN的强大分类能力。我们在检测提交中采用了类似的流程，但是在两个阶段都进行了改进，例如针对更高的目标边界框召回率的多框[5]预测，以及对边界框建议进行更好分类的集成方法。

3 Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of levels – of the network and its width: the number of units at each level. This is as an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However this simple solution comes with two major drawbacks.

改善深度神经网络性能的最直接方法是增加其大小。这包括增加网络的深度（级别数）和网络宽度：每个级别的单元数。这是训练高质量模型的一种简单而安全的方法，特别是考虑到有大量标记的训练数据的可用性。但是，这种简单的解决方案具有两个主要缺点。

Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This can become a major bottleneck, since the creation of high quality training sets can be tricky and expensive, especially if expert human raters are necessary to distinguish between fine-grained visual categories like those in ImageNet (even in the 1000-class ILSVRC subset) as demonstrated by Figure 1.

较大的尺寸通常意味着较大数量的参数，这会使扩展的网络更易于过度拟合，尤其是在训练集中标记的示例数量有限的情况下。这可能会成为主要瓶颈，因为创建高质量的训练集可能会非常棘手且昂贵，尤其是如果需要专家级评估人员来区分细粒度的视觉类别（例如ImageNet中的视觉类别，甚至是1000级ILSVRC子集）如图1所示。

在这里插入图片描述
Another drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then a lot of computation is wasted. Since in practice the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of results.

统一增加网络大小的另一个缺点是计算资源的使用急剧增加。例如，在深度视觉网络中，如果将两个卷积层连接在一起，则其滤波器数量的任何均匀增加都会导致计算的平方增加。如果增加的容量使用效率不高（例如，如果大多数权重最终接近于零），则将浪费大量计算量。由于在实践中计算预算始终是有限的，因此即使主要目的是提高结果的质量，也要有效分配计算资源，而不是随意增加大小。

The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle – neurons that fire together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.

解决这两个问题的根本方法是最终从全连接的架构过渡到稀疏的连接架构，甚至在卷积内部也是如此。除了模仿生物系统之外，由于Arora等人[2]的开创性工作，这还将具有更牢固的理论基础的优势。他们的主要结果表明，如果数据集的概率分布可以由大型的非常稀疏的深度神经网络表示，则可以通过分析最后一层的激活的相关统计量，逐层构建最佳网络拓扑，聚集具有高度相关输出的神经元。尽管严格的数学证明要求非常严格的条件，但该陈述与众所周知的Hebbian原理（将神经元一起发射，连接在一起）共同触发这一事实表明，即使在不太严格的条件下，实际上也可以应用基本思想。

On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off. The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing. The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.

不利的一面是，今天的计算基础架构在对非均匀稀疏数据结构进行数值计算时效率很低。即使算术运算的数量减少了100倍，查找和缓存丢失的开销仍然占主导地位，以至于切换到稀疏矩阵都不会奏效。通过使用稳定改进的，经过高度调优的数值库，利用底层CPU或GPU硬件的微小细节，可以实现极为快速的密集矩阵乘法，从而进一步拉大了差距[16，9]。而且，非均匀的稀疏模型需要更复杂的工程和计算基础结构。当前大多数面向视觉的机器学习系统仅通过使用卷积就在空间域中利用稀疏性。但是，卷积被实现为到较早层中补丁的密集连接的集合。自从[11]以来，ConvNets就一直在特征维度上使用随机和稀疏的连接表，以打破对称性并改善学习效果，为了更好地优化并行计算，使用[9]的趋势改回完全连接。结构的均匀性和大量的滤波器以及更大的批处理大小允许利用高效的密集计算。

This raises the question whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at filter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.

这就提出了一个问题，即下一步是否有希望：一种架构，如理论所建议的那样，利用了额外的稀疏性，即使在滤波器级别，也是如此，但是通过利用密集矩阵的计算来利用我们当前的硬件。关于稀疏矩阵计算的大量文献（例如[3]）表明，将稀疏矩阵聚类为相对密集的子矩阵往往会为稀疏矩阵乘法提供最新的实用性能。认为不久的将来将采用类似的方法来自动构建非统一的深度学习架构似乎并不为过。

The Inception architecture started out as a case study of the first author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, only after two iterations on the exact choice of topology, we could already see modest gains against the reference architecture based on [12]. After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly, they turned out to be at least locally optimal.

Inception结构最初是作为第一作者的案例研究进行的，该案例旨在评估复杂网络拓扑构造算法的假设输出，该算法试图逼近[2]所提及的视觉网络的稀疏结构，并通过密集，易于获得的方法覆盖假设的结果组件。尽管是一个高度投机的工作，但仅在对拓扑的确切选择进行两次迭代之后，我们已经可以基于[12]相对于参考体系结构看到适度的收益。在进一步调整学习率，超参数和改进的训练方法之后，我们确定了所得的Inception体系结构在定位和目标检测（作为[6]和[5]的基础网络）中特别有用。有趣的是，尽管大多数原始架构选择都受到了质疑和测试，但事实证明它们至少是局部最优的。

One must be cautious though: although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction. Making sure would require much more thorough analysis and verification: for example, if automated tools based on the principles described below would find similar, but better topology for the vision networks. The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture. At very least, the initial success of the Inception architecture yields firm motivation for exciting future work in this direction.

但是，必须谨慎：尽管所提出的体系结构已成功实现了计算机视觉，但是否可以将其质量归因于导致其构建的指导原则仍然值得怀疑。确保将需要进行更彻底的分析和验证：例如，如果基于下述原理的自动化工具是否可以为视觉网络找到相似但更好的拓扑结构。最有说服力的证据是，如果一个自动化系统会创建网络拓扑，从而使用相同的算法但在外观上却具有截然不同的全局结构，从而在其域域中获得相似的收益。至少，Inception架构的最初成功为沿着这个方向激动人心的未来工作提供了坚定的动力。

4 Architectural Details

The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by layer construction in which one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from the earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. This means, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patchalignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5, however this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success in current state of the art convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).

Inception架构的主要思想是基于找出卷积视觉网络中最佳局部稀疏结构的近似值，并通过易于使用的密集组件来覆盖它。请注意，假设平移不变意味着我们的网络将由卷积构建块构建。我们所需要的只是找到最佳的局部构造并在空间上进行重复。 Arora等[2]提出了一种逐层的结构，其中应该分析最后一层的相关统计量并将其聚类为具有高度相关性的单元组。这些群集形成下一层的单元，并连接到上一层的单元。我们假设来自较早层的每个单元对应于输入图像的某些区域，并且这些单元被分组为滤波器组（filter banks）。在较低的层（靠近输入层），相关单元将集中在局部区域。这意味着，我们最终将有许多聚类集中在一个区域中，并且可以在下一层中用1×1卷积层覆盖它们，如[12]中所建议。但是，人们也可以预期，在较大的斑块上可以通过卷积覆盖的空间分布越分散的簇数量就越少，在越来越大的区域上，斑块的数量就会减少。为了避免补丁对齐问题，Inception体系结构的当前版本被限制为1×1、3×3和5×5的滤波器大小，但是此决定更多地基于便利性而不是必要性。这也意味着建议的体系结构是所有这些层的组合，它们的输出滤波器组被串联到单个输出向量中，形成下一级的输入。另外，由于池化操作对于当前最先进的卷积网络的成功至关重要，因此建议在每个这样的阶段中添加替代并行池化路径也应具有额外的有益效果（见图2（a））。

As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.

由于这些“ Inception模块”彼此堆叠，因此它们的输出相关性统计数据必然会发生变化：随着更高层捕获更高抽象度的特征，预计它们的空间集中度（spatial concentration）会降低，这表明随着我们移到更高的层，3×3和5×5卷积的比率应该增加。

One big problem with the above modules, at least in this naive form, is that even a modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: their number of output filters equals to the number of filters in the previous stage. The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. Even while this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.

至少以这种简单形式，上述模块的一个大问题是，即使是数量有限的5×5卷积，在具有大量滤波器的卷积层之上，也可能是昂贵的。一旦将池化单元添加到混合中，此问题将变得更加明显：它们的输出滤波器数量等于上一阶段的滤波器数量。池化层的输出与卷积层的输出的合并将导致不可避免地增加每个阶段的输出数量。即使此体系结构可能涵盖了最佳的稀疏结构，它也会非常低效地进行处理，从而导致在几个阶段内出现计算爆炸。

This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to model. We would like to keep our representation sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose. The final result is depicted in Figure 2(b).

这导致了所提出的结构的第二个想法：明智地在任何情况下如果计算需求会增加太多的地方进行尺寸缩减和投影。这是基于嵌入的成功：即使是低维的嵌入也可能包含许多有关较大图像补丁的信息。但是，嵌入以密集，压缩的形式表示信息，并且压缩的信息很难建模。我们想在大多数地方保持表示稀疏（根据[2]的条件要求），并且仅在必须将它们进行整体聚集时才压缩信号。也就是说，在昂贵的3×3和5×5卷积之前，使用1×1卷积来计算减少量。除了用作减少量之外，它们还包括使用整流线性激活，使其具有双重用途。最终结果如图2（b）所示。
在这里插入图片描述
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional maxpooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.

通常，Inception网络是由彼此堆叠的上述类型的模块组成的网络，偶尔使用步长为2的最大池化层，以将网格的分辨率减半。由于技术原因（训练期间的内存效率），似乎仅在较高的层开始使用Inception模块，而以传统的卷积方式保留较低的层似乎是有益的。这不是严格必要的，只是反映了我们当前实施中的一些基础设施效率低下。

One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

该结构的主要优点之一是，它可以显著增加每个阶段的单元数，而不会导致计算复杂性的急剧增加。普遍使用降维可将最后一级的大量输入滤波器屏蔽到下一层，首先缩小其尺寸，然后再以较大的补丁尺寸对它们进行卷积。该设计的另一个实际有用的方面是，它与直觉一致，即视觉信息应以不同的比例进行处理，然后进行汇总，以便下一阶段可以同时从不同的比例中提取特征。

The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are 2?3× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.

改进的计算资源使用可以增加每个阶段的宽度以及阶段数，而不会引起计算上的困难。利用初始架构的另一种方法是创建稍逊一筹但在计算上更节约的版本。我们发现，所有随附的旋钮和操纵杆都可以实现计算资源的受控平衡，这可能导致网络的性能比具有非Inception架构的类似性能的网络快2-3倍，但是这时需要谨慎的手动设计。

5 GoogLeNet

在这里插入图片描述
We chose GoogLeNet as our team-name in the ILSVRC14 competition. This name is an homage to Yann LeCuns pioneering LeNet 5 network [10]. We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition. We have also used a deeper and wider Inception network, the quality of which was slightly inferior, but adding it to the ensemble seemed to improve the results marginally. We omit the details of that network, since our experiments have shown that the influence of the exact architectural parameters is relatively minor. Here, the most successful particular instance (named GoogLeNet) is described in Table 1 for demonstrational purposes. The exact same topology (trained with different sampling methods) was used for 6 out of the 7 models in our ensemble.

在ILSVRC14比赛中，我们选择了GoogLeNet作为我们的团队名称。这个名字是对Yann LeCuns开创性的LeNet5网络的敬意[10]。我们还使用GoogLeNet来指代我们提交的竞赛中使用的Inception体系结构的特定形式。我们还使用了更深入，更广泛的Inception网络，其质量略逊一筹，但是将其添加到集成模型中似乎可以稍微改善结果。我们省略了该网络的细节，因为我们的实验表明，确切的结构参数的影响相对较小。此处，出于演示目的，表1中描述了最成功的特定实例（名为GoogLeNet）。在我们的合奏中，对7个模型中的6个使用了完全相同的拓扑（使用不同的采样方法进行训练）。

All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224×224 taking RGB color channels with mean subtraction. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.

所有卷积，包括Inception模块内部的那些卷积，均使用整流线性激活。在我们的网络中，接收场（receptive field）的大小为224×224，采用RGB颜色通道减去均值。 “＃3×3缩小”和“＃5×5缩小”表示在3×3和5×5卷积之前使用的缩小层中1×1滤波器的数量。在pool proj列中内置最大池化之后，可以看到投影层中1×1滤波器的数量。所有这些缩小/投影层也都使用整流线性激活。

The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint. The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. However this number depends on the machine learning infrastructure system used. The use of average pooling before the classifier is based on [12], although our implementation differs in that we use an extra linear layer. This enables adapting and fine-tuning our networks for other label sets easily, but it is mostly convenience and we do not expect it to have a major effect. It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

该网络在设计时就考虑到了计算效率和实用性，因此可以在包括计算资源有限（尤其是内存占用量少）的单个设备上运行推理。仅计算带参数的层时，网络深22层（如果我们也计算池，则网络为27层）。用于网络构建的层（独立构建块）的总数约为100。但是，此数目取决于所使用的机器学习基础结构系统。在分类器之前使用平均池化基于[12]，尽管我们的实现方式有所不同，因为我们使用了额外的线性层。这使我们可以轻松地为其他标签集调整和微调我们的网络，但这主要是方便，我们认为它不会产生重大影响。已经发现，从完全连接的层转移到平均池化可将top-1精度提高约0.6％，但是即使在删除完全连接的层之后，仍必须使用dropout。

Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.

考虑到网络的深度较大，以有效方式将梯度传播回所有层的能力是一个问题。一个有趣的见解是，相对较浅的网络在此任务上的强大性能表明，网络中间各层所产生的功能应非常有区别。通过添加连接到这些中间层的辅助分类器，我们有望鼓励在分类器的较低级进行区分，增加传播回去的梯度信号，并提供其它正则化。这些分类器采用较小的卷积网络的形式，位于Inception（4a）和（4d）模块的输出之上。在训练过程中，它们的损失将以折扣权重添加到网络的总损失中（辅助分类器的损失加权为0.3）。在推断时，这些辅助网络将被丢弃。

The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:
侧面的额外网络（包括辅助分类器）的确切结构如下：

An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.
平均池化层的过滤大小为5×5，步幅为3，导致（4a）的输出为4×4×512，（4d）的输出为4×4×528。
A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
具有128个滤波器的1×1卷积，用于减小尺寸和校正线性激活。
A fully connected layer with 1024 units and rectified linear activation.
具有1024个单位的全连接层，并使用线性校正激活。
A dropout layer with 70% ratio of dropped outputs.
Dropout层的dropout rate为70％。
A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).
一个具有softmax损失的线性层作为分类器（预测与主分类器相同的1000个分类，但在推理时将其删除）。

A schematic view of the resulting network is depicted in Figure 3.
在这里插入图片描述
Figure 3: GoogLeNet network with all the bells and whistles

6 Training Methodology

Our networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.

我们的网络是使用DistBelief [4]分布式机器学习系统进行训练的，该系统使用了少量模型和数据并行性。尽管我们仅使用基于CPU的实现，但粗略估计表明，可以训练GoogLeNet网络在一周内使用很少的高端GPU进行融合，主要限制是内存使用率。我们的训练使用了具有0.9动量的异步随机梯度下降[17]，固定的学习速率计划（每8个周期将学习速率降低4％）。Polyak平均[13]用于创建推理时使用的最终模型。

Our image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, like dropout and learning rate, so it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3. Also, we found that the photometric distortions by Andrew Howard [8] were useful to combat overfitting to some extent. In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes, so we could not tell definitely whether the final results were affected positively by their use.

在进入比赛的几个月中，我们的图像采样方法发生了重大变化，并且已经融合了模型的训练采用了其它选项，有时还结合了已更改的超参数，例如dropout rate和学习率，因此很难给出最有效的单一方法来训练这些网络。使问题更加复杂的是，受[8]的启发，某些模型主要针对较小的裁剪图像进行了训练，而另一些则针对较大的裁剪图像进行了训练。尽管如此，一项经过验证在比赛后效果很好的处方包括对各种尺寸的图像斑块进行采样，这些斑块的大小均匀分布在图像区域的8％和100％之间，并且长宽比在3/4和4/3之间随机选择 。此外，我们发现Andrew Howard[8]的光度失真（photometric distortions）在某种程度上有助于防止过度拟合。此外，我们开始使用随机插值方法（双线性，面积，最近邻和三次，具有相等的概率）来调整相对较晚的大小，并结合其它超参数更改，因此我们无法确切确定最终结果是否受到以下因素的正影响他们的使用。

7 ILSVRC 2014 Classification Challenge Setup and Results

The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.

ILSVRC 2014分类挑战涉及将图像分类为Imagenet层次结构中1000个叶节点类别之一的任务。大约有120万张图像用于训练，50000张图像用于验证，100000张图像用于测试。每幅图像都有真实类别标签，并且根据得分最高的分类器预测来衡量性能。通常报告两个数字：top-1准确率，将真实标签与第一个预测类进行比较；top-5错误率，将真实标签与前5个预测类进行比较：如果真实标签在top-5之中，则无论其排名如何，图像被视为正确分类。该挑战赛使用top-5错误率进行排名。

We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we elaborate below.

我们参加了挑战，没有用于训练的额外数据。除了本文前面提到的训练技术外，我们在测试过程中采用了一组技术来获得更高的性能，我们将在下面详细介绍。

We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, mainly because of an oversight) and learning rate policies, and they only differ in sampling methodologies and the random order in which they see input images.
我们独立训练了同一GoogLeNet模型的7个版本（包括一个更广泛的版本），并对其进行了整体预测。这些模型以相同的初始化（即使是相同的初始权重，主要是由于疏忽而定）和学习速率策略进行了训练，并且它们的区别仅在于采样方法和它们看到输入图像的随机顺序。
During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resize the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This results in 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme.We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).
在测试过程中，我们采用了比K??rizhevsky等人[9]更积极的裁剪方法。具体来说，我们将图像调整为4个比例，其中较短的尺寸（高度或宽度）分别为256、288、320和352，取这些调整后图像的左，中和右正方形（对于人像，则取顶部，中央和底部正方形）。然后，对于每个正方形，我们采用4个角和中心224×224裁剪，以及将正方形调整为224×224的大小及其镜像版本。这样一来，每张图像将产生4×3×6×2 = 144个裁剪。Andrew Howard[8]在去年的论文中也使用了类似的方法，我们通过实证证明，该方法的效果比我们提出的方案稍差。我们注意到，在实际应用中可能不必进行这种激进的裁剪，因为在存在合理数量的裁剪之后，更多裁剪的收益变得微不足道（如我们稍后将展示）。
The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging.
将softmax概率在多个裁剪和所有单个分类器上取平均，以获得最终预测。在我们的实验中，我们分析了验证数据的替代方法，例如裁剪的最大池化和对分类器的平均，但与简单平均相比，它们的性能较差。

In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.

在本文的其余部分，我们分析了影响最终提交书整体性能的多种因素。

Our final submission in the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. The following table shows the statistics of some of the top-performing approaches.

我们在挑战赛中的最终提交在验证和测试数据上均获得6.67％的top-5错误率，在其它参与者中排名第一。与2012年的SuperVision方法相比，相对减少了56.5％，与上一年的最佳方法（Clarifai）相比，减少了约40％，二者均使用外部数据来训练分类器。下表显示了一些效果最好的方法的统计信息。

We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in the following table. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.

我们还通过更改下表中预测图像时的模型数量和使用的裁剪数量，来分析和报告多种测试选择的性能。当我们使用一种模型时，我们选择了验证数据上的top-1错误率最低的模型。所有数字均报告在验证数据集上，以免过度拟合测试数据统计信息。

8 ILSVRC 2014 Detection Challenge Setup and Results

在这里插入图片描述
The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary from large to tiny. Results are reported using the
mean average precision (mAP).

ILSVRC 目标检测任务是在200种可能的类别中的图像中的对象周围生成边界框。如果检测到的目标与真实标签类别匹配并且其边界框重叠至少50％（使用Jaccard索引），则算为正确。无关检测会被视为误报，并会受到处罚。与分类任务相反，每个图像可能包含许多对象或不包含任何对象，并且其比例可能从大到小变化。使用平均平均精度（mean average precision，mAP）报告结果。

The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the Selective Search [20] approach with multi-box [5] predictions for higher object bounding box recall. In order to cut down the number of false positives, the superpixel size was increased by 2×. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the proposals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 ConvNets when classifying each region which improves results from 40% to 43.9% accuracy. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.

GoogLeNet用于检测的方法与[6]中的R-CNN相似，但是使用Inception模型作为区域分类器进行了扩充。此外，通过将选择性搜索[20]方法与多框[5]预测相结合，可以提高区域提议步骤，从而实现更高的对象边界框召回率。为了减少误报的数量，超像素尺寸增加了2倍。这将来自选择性搜索算法的建议减半。我们增加了200个来自多框[5]的区域提案，总共占[6]使用的提案的约60％，而覆盖率则从92％扩大到93％。减少提案数量并增加覆盖范围的总体效果是，单个模型案例的平均平均精度提高了1％。最后，在对每个区域进行分类时，我们使用6个ConvNet的集合，将结果的准确度从40％提高到43.9％。请注意，与R-CNN相反，由于缺乏时间，我们没有使用边界框回归。

We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use Convolutional Networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.

我们首先报告最高的检测结果，并显示自第一版检测任务以来的进度。与2013年的结果相比，准确性几乎提高了一倍。表现最好的团队都使用卷积网络。我们在表4中报告了官方成绩以及每个团队的共同策略：使用外部数据，整体模型或上下文模型。外部数据通常是用于预先训练模型的ILSVRC12分类数据，该模型随后将根据检测数据进行完善。一些团队还提到了本地化数据的使用。由于检测数据集中未包含很大一部分的本地化任务边界框，因此可以使用此数据对通用边界框回归器进行预训练，就像使用分类进行预训练一样。 GoogLeNet条目未使用本地化数据进行预培训。

In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.

在表5中，我们仅使用单个模型比较结果。表现最好的模型是Deep Insight提供的，令人惊讶的是，只有3个模型的组合才提高了0.3点，而GoogLeNet的组合却获得了明显更强的结果。

9 Conclusions

Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Also note that our detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture. Although it is expected that similar quality of result can be achieved by much more expensive networks of similar depth and width, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest promising future work towards creating sparser and more refined structures in automated ways on the basis of [2].

我们的结果似乎提供了有力的证据，即通过随时可用的密集构造块来近似预期的最佳稀疏结构，是改善计算机视觉神经网络的可行方法。与较浅和较不宽泛的网络相比，此方法的主要优点是在计算需求适度增加的情况下可显着提高质量。还要注意，尽管我们既没有利用上下文也没有执行边界框回归，但我们的检测工作具有竞争性，这一事实进一步证明了Inception体系结构的实力。尽管可以预期，通过深度和宽度相近的昂贵得多的网络可以实现类似的结果质量，但是我们的方法得出的确凿证据表明，转向稀疏结构通常是可行且有用的想法。这表明在[2]的基础上，有希望的未来工作以自动化方式创建稀疏和更精细的结构。