当前位置: 代码迷 >> 综合 >> 【CV-Paper 06】Inception V3-2015
  详细解决方案

【CV-Paper 06】Inception V3-2015

热度:31   发布时间:2024-02-13 14:15:45.0

论文原文:LINK
论文年份:2015
论文被引:9190(21/08/2020)


文章目录

  • Rethinking the Inception Architecture for Computer Vision
  • Abstract
  • 1. Introduction
  • 2. General Design Principles
  • 3. Factorizing Convolutions with Large Filter Size
    • 3.1. Factorization into smaller convolutions
    • 3.2. Spatial Factorization into Asymmetric Convolutions
  • 4. Utility of Auxiliary Classifiers
  • 5. Efficient Grid Size Reduction
  • 6. Inception-v2
  • 7. Model Regularization via Label Smoothing
  • 8. Training Methodology
  • 9. Performance on Lower Resolution Input
  • 10. Experimental Results and Comparisons
  • 11. Conclusions


Rethinking the Inception Architecture for Computer Vision

Abstract

Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.

卷积网络是用于大多数最新计算机视觉各种任务解决方案的核心。自2014年以来,非常深的卷积网络开始成为主流,在各种基准测试中均取得了不俗的结果。尽管增加的模型大小和计算成本往往会转化为大多数任务的即时质量提升(只要提供足够的标记数据以进行培训),但计算效率和低参数数量仍然是各种用例(如移动视觉和大数据应用)的促成因素。在这里,我们正在探索以适当的因式分解卷积和积极正则化为目标,以尽可能高效地利用增加的计算的方式来扩展网络的方法。我们在ILSVRC 2012分类挑战验证集上对我们的方法进行了基准测试,结果证明,与现有技术相比,该方法具有实质性的优势:使用一个网络的单帧评估的top-1和top-5误差分别为1.2%和5.6%,每个推理的计算成本为50亿次乘法运算,并且使用的参数少于2500。结合4种模型和多种裁剪评估,获得了3.5%的top-5错误和17.3%的top-1错误。


1. Introduction

Since the 2012 ImageNet competition [16] winning entry by Krizhevsky et al [9], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [5], segmentation [12], human pose estimation [22], video classification [8], object tracking [23], and superresolution [3].

自从Krizhevsky等人[9]在2012年ImageNet竞赛中获奖以来,其网络“AlexNet”已成功应用于各种计算机视觉任务,例如目标检测[5],分割[12] ,人体姿态估计[22],视频分类[8],对象跟踪[23]和超分辨率[3]。

These successes spurred a new line of research that focused on finding higher performing convolutional neural networks. Starting in 2014, the quality of network architectures significantly improved by utilizing deeper and wider networks. VGGNet [18] and GoogLeNet [20] yielded similarly high performance in the 2014 ILSVRC [16] classification challenge. One interesting observation was that gains in the classification performance tend to transfer to significant quality gains in a wide variety of application domains. This means that architectural improvements in deep convolutional architecture can be utilized for improving performance for most other computer vision tasks that are increasingly reliant on high quality, learned visual features. Also, improvements in the network quality resulted in new application domains for convolutional networks in cases where AlexNet features could not compete with hand engineered, crafted solutions, e.g. proposal generation in detection[4].

这些成功激发了新的研究领域,专注于发现性能更高的卷积神经网络。从2014年开始,通过利用更深更广的网络,网络架构的质量得到了显着改善。在2014年ILSVRC [16]分类挑战中,VGGNet [18]和GoogLeNet [20]产生了相似的高性能。一个有趣的发现是,分类性能的提高趋向于在各种应用领域中转换为明显的质量提高。这意味着深度卷积结构中的结构改进可用于提高大多数其它计算机视觉任务的性能,这些任务越来越依赖于高质量的学习的视觉特征。而且,在AlexNet功能无法与手工设计,精制的解决方案(例如,NetApp)竞争的情况下,网络质量的提高导致卷积网络有了新的应用领域。检测中的提案生成[4]。

Although VGGNet [18] has the compelling feature of architectural simplicity, this comes at a high cost: evaluating the network requires a lot of computation. On the other hand, the Inception architecture of GoogLeNet [20] was also designed to perform well even under strict constraints on memory and computational budget. For example, GoogleNet employed only 5 million parameters, which represented a 12× reduction with respect to its predecessor AlexNet, which used 60 million parameters. Furthermore, VGGNet employed about 3x more parameters than AlexNet.

尽管VGGNet [18]具有结构简单的引人注目的功能,但这付出了高昂的代价:评估网络需要大量的计算。另一方面,即使在对内存和计算预算的严格限制下,GoogLeNet [20]的Inception体系结构也被设计为可以表现良好。例如,GoogleNet仅使用500万个参数,这比其前身AlexNet(使用6000万个参数)减少了12倍。此外,VGGNet使用的参数比AlexNet多大约3倍。

The computational cost of Inception is also much lower than VGGNet or its higher performing successors [6]. This has made it feasible to utilize Inception networks in big-data scenarios[17], [13], where huge amount of data needed to be processed at reasonable cost or scenarios where memory or computational capacity is inherently limited, for example in mobile vision settings. It is certainly possible to mitigate parts of these issues by applying specialized solutions to target memory use [2], [15] or by optimizing the execution of certain operations via computational tricks [10]. However, these methods add extra complexity. Furthermore, these methods could be applied to optimize the Inception architecture as well, widening the efficiency gap again.

Inception的计算成本也远低于VGGNet或其性能更高的后继版本[6]。这使得在大数据场景[17],[13]中利用Inception网络变得可行,在这种情况下,需要以合理的成本处理大量数据,或者固有地限制内存或计算能力的场景,例如在移动视觉中设置。通过针对目标内存使用应用专门的解决方案[2],[15]或通过计算技巧优化某些操作的执行[10],当然可以缓解这些问题的一部分。但是,这些方法增加了额外的复杂性。此外,这些方法也可以用于优化Inception架构,从而再次扩大效率差距。

Still, the complexity of the Inception architecture makes it more difficult to make changes to the network. If the architecture is scaled up naively, large parts of the computational gains can be immediately lost. Also, [20] does not provide a clear description about the contributing factors that lead to the various design decisions of the GoogLeNet architecture. This makes it much harder to adapt it to new use-cases while maintaining its efficiency. For example, if it is deemed necessary to increase the capacity of some Inception-style model, the simple transformation of just doubling the number of all filter bank sizes will lead to a 4x increase in both computational cost and number of parameters. This might prove prohibitive or unreasonable in a lot of practical scenarios, especially if the associated gains are modest. In this paper, we start with describing a few general principles and optimization ideas that that proved to be useful for scaling up convolution networks in efficient ways. Although our principles are not limited to Inceptiontype networks, they are easier to observe in that context as the generic structure of the Inception style building blocks is flexible enough to incorporate those constraints naturally. This is enabled by the generous use of dimensional reduction and parallel structures of the Inception modules which allows for mitigating the impact of structural changes on nearby components. Still, one needs to be cautious about doing so, as some guiding principles should be observed to maintain high quality of the models.

尽管如此,Inception架构的复杂性使得对网络进行更改变得更加困难。如果天真地扩展该体系结构,很大一部分计算收益可能会立即丢失。同样,[20]也没有提供关于导致GoogLeNet体系结构的各种设计决策的因素的清晰描述。这使得在保持其效率的同时使其适应新用例变得更加困难。例如,如果认为有必要增加某些Inception模型的容量,则只需将所有滤波器组大小的数量加倍,这样简单转换将导致计算成本和参数数量增加4倍。在许多实际情况下,这可能被证明是禁止或不合理的,尤其是在相关收益不大的情况下。在本文中,我们从描述一些通用原理和优化思想开始,这些原理和优化思想被证明对于以有效方式扩展卷积网络很有用。尽管我们的原理不仅限于Inception类型的网络,但在那种情况下它们更易于观察,因为Inception样式构建块的通用结构足够灵活,可以自然地合并这些约束。这可以通过大量使用Inception模块的尺寸缩小和平行结构来实现,从而可以减轻结构变化对附近组件的影响。仍然需要谨慎行事,因为应遵循一些指导原则以保持模型的高质量。


2. General Design Principles

Here we will describe a few design principles based on large-scale experimentation with various architectural choices with convolutional networks. At this point, the utility of the principles below are speculative and additional future experimental evidence will be necessary to assess their accuracy and domain of validity. Still, grave deviations from these principles tended to result in deterioration in the quality of the networks and fixing situations where those deviations were detected resulted in improved architectures in general.

在这里,我们将基于卷积网络的各种架构选择的大规模实验,描述一些设计原则。在这一点上,以下原理的实用性是推测性的,将来需要更多的实验证据来评估其准确性和有效性范围。尽管如此,与这些原理的严重偏离往往会导致网络质量下降,而修复检测到这些偏离的情况通常会改善结构。

  • 1.Avoid representational bottlenecks, especially early in the network. Feed-forward networks can be represented by an acyclic graph from the input layer(s) to the classifier or regressor. This defines a clear direction for the information flow. For any cut separating the inputs from the outputs, one can access the amount of information passing though the cut. One should avoid bottlenecks with extreme compression. In general the representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand. Theoretically, information content can not be assessed merely by the dimensionality of the representation as it discards important factors like correlation structure; the dimensionality merely provides a rough estimate of information content.

  • 1.避免表征瓶颈(bottlenecks),尤其是在网络早期。前馈网络可以由从输入层到分类器或回归器的非循环图表示。这为信息流定义了明确的方向。对于将输入与输出分开的任何剪切,都可以访问经过剪切的信息量。人们应该避免极端压缩的瓶颈。通常,在到达用于手头任务的最终表示之前,表示大小应从输入逐渐减小到输出。从理论上讲,信息内容不能仅仅通过表示的维度来评估,因为它丢弃了诸如相关结构之类的重要因素。维度仅提供信息内容的粗略估计。

  • 2.Higher dimensional representations are easier to process locally within a network. Increasing the activations per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster.

  • 2.高维表示更易于在网络内本地处理。在卷积网络中增加每个图块的激活次数可以使特征更解开。生成的网络将训练得更快。

  • 3.Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power. For example, before performing a more spread out (e.g. 3 × 3) convolution, one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects. We hypothesize that the reason for that is the strong correlation between adjacent unit results in much less loss of information during dimension reduction, if the outputs are used in a spatial aggregation context. Given that these signals should be easily compressible, the dimension reduction even promotes faster learning.

  • 3.可以在较低维的嵌入上进行空间聚合,而不会损失很多表示能力。例如,在进行更分散(例如3×3)的卷积之前,可以在空间聚集之前减小输入表示的尺寸,而不会期望严重的不利影响。我们假设这样做的原因是,如果在空间聚合环境中使用输出,则在降维期间相邻单元之间的强相关性将导致更少的信息丢失。鉴于这些信号应易于压缩,因此降维甚至可以促进更快的学习。

  • 4.Balance the width and depth of the network. Optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network. Increasing both the width and the depth of the network can contribute to higher quality networks. However, the optimal improvement for a constant amount of computation can be reached if both are increased in parallel. The computational budget should therefore be distributed in a balanced way between the depth and width of the network.

  • 4.4.平衡网络的宽度和深度。通过平衡每个阶段的滤波器数量和网络深度,可以达到网络的最佳性能。增加网络的宽度和深度可以有助于提高网络质量。但是,如果并行增加两者,则可以达到恒定计算量的最佳改进。因此,应在网络的深度和宽度之间以平衡的方式分配计算预算。


3. Factorizing Convolutions with Large Filter Size

Much of the original gains of the GoogLeNet network [20] arise from a very generous use of dimension reduction. This can be viewed as a special case of factorizing convolutions in a computationally efficient manner. Consider for example the case of a 1 × 1 convolutional layer followed by a 3 × 3 convolutional layer. In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.

GoogLeNet网络[20]的许多原始收益来自对降维的大量使用。可以将其视为以计算有效方式分解卷积的特殊情况。考虑例如1×1卷积层随后是3×3卷积层的情况。在视觉网络中,期望附近激活的输出高度相关。因此,我们可以期望它们的激活在聚合之前可以减少,并且这将导致类似的局部表达

Here we explore other ways of factorizing convolutions in various settings, especially in order to increase the computational efficiency of the solution. Since Inception networks are fully convolutional, each weight corresponds to one multiplication per activation. Therefore, any reduction in computational cost results in reduced number of parameters. This means that with suitable factorization, we can end up with more disentangled parameters and therefore with faster training. Also, we can use the computational and memory savings to increase the filter-bank sizes of our network while maintaining our ability to train each model replica on a single computer.

在这里,我们探索了在各种情况下分解卷积的其他方法,特别是为了提高解的计算效率。由于初始网络是完全卷积的,因此每个权重对应于每次激活一个乘法。因此,任何计算成本的减少都会导致参数数量的减少。这意味着,通过适当的分解,我们可以得到更多解开的参数,从而可以更快地进行训练。同样,我们可以使用节省的计算和内存来增加网络的滤波器组大小,同时保持在单台计算机上训练每个模型副本的能力。

在这里插入图片描述

3.1. Factorization into smaller convolutions

Convolutions with larger spatial filters (e.g. 5 × 5 or 7 × 7) tend to be disproportionally expensive in terms of computation. For example, a 5 × 5 convolution with n filters over a grid with m filters is 25/9 = 2.78 times more computationally expensive than a 3 × 3 convolution with the same number of filters. Of course, a 5×5 filter can capture dependencies between signals between activations of units further away in the earlier layers, so a reduction of the geometric size of the filters comes at a large cost of expressiveness. However, we can ask whether a 5×5 convolution could be replaced by a multi-layer network with less parameters with the same input size and output depth. If we zoom into the computation graph of the 5 × 5 convolution, we see that each output looks like a small fully-connected network sliding over 5×5 tiles over its input (see Figure 1). Since we are constructing a vision network, it seems natural to exploit translation invariance again and replace the fully connected component by a two layer convolutional architecture: the first layer is a 3×3 convolution, the second is a fully connected layer on top of the 3 × 3 output grid of the first layer (see Figure 1). Sliding this small network over the input activation grid boils down to replacing the 5 × 5 convolution with two layers of 3 × 3 convolution (compare Figure 4 with 5).

在计算方面,具有较大尺寸空间滤波器(例如5×5或7×7)的卷积往往会不成比例地增加运算量。例如,具有m个滤波器的网格上具有n个滤波器的5×5卷积比具有相同滤波器数量的3×3卷积的计算成本高25/9 = 2.78倍。当然,5×5 滤波器可以捕获较早层中距离较远的单元的激活之间的信号之间的依赖性,因此减小滤波器的几何尺寸会付出较大的表达成本。但是,我们可以问是否可以用具有较少参数且具有相同输入大小和输出深度的多层网络代替5×5卷积。如果放大5×5卷积的计算图,我们会看到每个输出看起来像一个小型的全连接网络,在其输入上的5×5磁贴上滑动(参见图1)。由于我们正在构建视觉网络,因此再次利用平移不变性并用两层卷积结构替换完全连接的组件似乎是很自然的:第一层是3×3卷积,第二层是位于顶部的完全连接层第一层的3×3输出网格(请参见图1)。在输入激活网格上滑动此小型网络归结为用两层3×3卷积替换5×5卷积(将图4与5进行比较)。

This setup clearly reduces the parameter count by sharing the weights between adjacent tiles. To analyze the expected computational cost savings, we will make a few simplifying assumptions that apply for the typical situations: We can assume that n = αm, that is that we want to change the number of activations/unit by a constant alpha factor. Since the 5 × 5 convolution is aggregating, α is typically slightly larger than one (around 1.5 in the case of GoogLeNet). Having a two layer replacement for the 5 × 5 layer, it seems reasonable to reach this expansion in two steps: increasing the number of filters by√α in both steps. In order to simplify our estimate by choosing α = 1 (no expansion), If we would naivly slide a network without reusing the computation between neighboring grid tiles, we would increase the computational cost. sliding this network can be represented by two 3×3 convolutional layers which reuses the activations between adjacent tiles. This way, we end up with a net 9 + 9 25 × \frac{9 + 9}{25}× reduction of computation, resulting in a relative gain of 28% by this factorization. The exact same saving holds for the parameter count as each parameter is used exactly once in the computation of the activation of each unit. Still, this setup raises two general questions: Does this replacement result in any loss of expressiveness? If our main goal is to factorize the linear part of the computation, would it not suggest to keep linear activations in the first layer? We have ran several control experiments (for example see figure 2) and using linear activation was always inferior to using rectified linear units in all stages of the factorization. We attribute this gain to the enhanced space of variations that the network can learn especially if we batchnormalize [7] the output activations. One can see similar effects when using linear activations for the dimension reduction components.

通过在相邻图块之间共享权重,此设置明显减少了参数计数。为了分析预期的计算成本节省,我们将对适用于典型情况的情况进行一些简化的假设:我们可以假设 n = α m n =αm ,即我们想通过恒定的alpha因子来更改activations/unit。由于 5 × 5 5×5 卷积正在聚集,因此 α α 通常略大于 1 1 (在GoogLeNet的情况下约为1.5)。用两层替换5×5层,似乎可以分两步实现此扩展:在两步中都增加 α √α 的滤波器数量。为了通过选择 α = 1 α= 1 (无扩展)简化我们的估计,如果我们在不重复使用相邻网格图块之间的计算的情况下简单地滑动网络,则会增加计算成本。可以通过两个 3 × 3 3×3 卷积层来表示此网络的滑动,该卷积层可重复使用相邻图块之间的激活。这样,我们最终得到网络 9 + 9 25 \frac{9 + 9}{25} 倍的计算量减少,通过这种分解,相对增益为28%。由于每个参数在每个单元的激活计算中仅使用一次,因此对参数计数的保存完全相同。仍然,此设置引发了两个一般性问题:此替换是否会导致表现力损失?如果我们的主要目标是分解计算的线性部分,是否不建议在第一层保留线性激活?我们已经进行了几次对照实验(例如,参见图2),并且在分解的所有阶段中,使用线性激活总是不如使用整流线性单元。我们将此收益归因于网络可以学习的变化空间的增加,特别是如果我们批量标准化[7]输出激活。将线性激活用于降维组件时,可以看到类似的效果。

在这里插入图片描述
Figure 2. One of several control experiments between two Inception models, one of them uses factorization into linear + ReLU layers, the other uses two ReLU layers. After 3.86 million operations, the former settles at 76.2%, while the latter reaches 77.2% top-1 Accuracy on the validation set.

图2.两个Inception模型之间的几个控制实验之一,其中一个使用分解为线性+ ReLU层,另一个使用两个ReLU层。在进行了386万次操作后,前者稳定在76.2%,而后者在验证集上达到了77.2%的top-1准确性。

3.2. Spatial Factorization into Asymmetric Convolutions

The above results suggest that convolutions with filters larger 3 × 3 a might not be generally useful as they can always be reduced into a sequence of 3 × 3 convolutional layers. Still we can ask the question whether one should factorize them into smaller, for example 2×2 convolutions. However, it turns out that one can do even better than 2 × 2 by using asymmetric convolutions, e.g. n×1. For example using a 3 × 1 convolution followed by a 1 × 3 convolution is equivalent to sliding a two layer network with the same receptive field as in a 3 × 3 convolution (see figure 3). Still the two-layer solution is 33% cheaper for the same number of output filters, if the number of input and output filters is equal. By comparison, factorizing a 3 × 3 convolution into a two 2 × 2 convolution represents only a 11% saving of computation.

以上结果表明,使用较大 3×3 的滤波器进行卷积通常不会有用,因为它们总是可以减少为3×3卷积层的序列。我们可以问一个问题,是否应该将它们分解为较小的卷积,例如2×2卷积。然而,事实证明,通过使用非对称卷积,甚至可以做得比2×2好,例如,n×1。使用3×1卷积再加上1×3卷积等效于滑动具有与3×3卷积相同的接收场的两层网络(请参见图3)。如果输入和输出滤波器的数量相等,那么对于相同数量的输出滤波器,两层解决方案仍然节省33%的计算量。相比之下,将3×3卷积分解为两个2×2卷积仅节省11%的计算量

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
In theory, we could go even further and argue that one can replace any n × n convolution by a 1 × n convolution followed by a n×1 convolution and the computational cost saving increases dramatically as n grows (see figure 6). In practice, we have found that employing this factorization does not work well on early layers, but it gives very good results on medium grid-sizes (On m×m feature maps, where m ranges between 12 and 20). On that level, very good results can be achieved by using 1 × 7 convolutions followed by 7 × 1 convolutions.

从理论上讲,我们甚至可以进一步论证说,可以用1×n卷积代替n×1卷积来替换任何n×n卷积,并且随着n的增加,节省的计算成本会大大增加(见图6)。在实践中,我们发现采用这种分解在早期图层上效果不佳,但在中等网格大小(在m×m特征图上,m的范围在12到20之间)上提供了很好的结果。在此级别上,通过使用1×7卷积再加上7×1卷积可以实现非常好的结果。


4. Utility of Auxiliary Classifiers

[20] has introduced the notion of auxiliary classifiers to improve the convergence of very deep networks. The original motivation was to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combating the vanishing gradient problem in very deep networks. Also Lee et al[11] argues that auxiliary classifiers promote more stable learning and better convergence. Interestingly, we found that auxiliary classifiers did not result in improved convergence early in the training: the training progression of network with and without side head looks virtually identical before both models reach high accuracy. Near the end of training, the network with the auxiliary branches starts to overtake the accuracy of the network without any auxiliary branch and reaches a slightly higher plateau.

文献[20]引入了**辅助分类器(auxiliary classifiers)**的概念,以改善非常深层网络的收敛性。最初的动机是将有用的梯度推到较低的层,以使它们立即可用,并通过解决非常深的网络中消失的梯度问题来提高训练期间的收敛性。 Lee等人[11]也认为辅助分类器可以促进更稳定的学习和更好的收敛。有趣的是,我们发现辅助分类器并未在训练初期改善收敛性:在两个模型都达到高精度之前,带有和不带有侧分支的网络的训练进程看起来几乎相同。在训练快要结束时,带有辅助分支的网络开始超越没有任何辅助分支的网络的精度,并达到略高的平稳期。

Also [20] used two side-heads at different stages in the network. The removal of the lower auxiliary branch did not have any adverse effect on the final quality of the network. Together with the earlier observation in the previous paragraph, this means that original the hypothesis of [20] that these branches help evolving the low-level features is most likely misplaced. Instead, we argue that the auxiliary classifiers act as regularizer. This is supported by the fact that the main classifier of the network performs better if the side branch is batch-normalized [7] or has a dropout layer. This also gives a weak supporting evidence for the conjecture that batch normalization acts as a regularizer.

同样,[20]在网络的不同阶段使用了两个侧分支(side-heads)。删除下部辅助分支不会对网络的最终质量产生任何不利影响。连同上一段中的较早观察,这意味着[20]的假说很可能是错误的,这些假说是这些侧分支有助于演化低级特征的。相反,我们认为辅助分类器充当正则化器。如果侧分支是批归一化的[7]或具有dropout,则网络的主分类器会更好地支持这一事实。这也为批处理归一化充当正则化的猜想提供了微弱的支持证据

在这里插入图片描述
在这里插入图片描述
Figure 7. Inception modules with expanded the filter bank outputs. This architecture is used on the coarsest (8 × 8) grids to promote high dimensional representations, as suggested by principle 2 of Section 2. We are using this solution only on the coarsest grid, since that is the place where producing high dimensional sparse representation is the most critical as the ratio of local processing (by 1 × 1 convolutions) is increased compared to the spatial aggregation.

图7.扩展了滤波器组输出的起始模块。正如第2节的原理2所建议的那样,此体系结构用于最粗糙的(8×8)网格以促进高维表示。我们仅在最粗糙的网格上使用此解决方案,因为这是生成高维稀疏表示的地方与空间聚集相比,局部处理的比率(按1×1卷积)增加是最关键的。


5. Efficient Grid Size Reduction

Traditionally, convolutional networks used some pooling operation to decrease the grid size of the feature maps. In order to avoid a representational bottleneck, before applying maximum or average pooling the activation dimension of the network filters is expanded. For example, starting a d × d d×d grid with k k filters, if we would like to arrive at a d 2 × d 2 \frac{d}{2}×\frac{d}{2} grid with 2k filters, we first need to compute a stride-1 convolution with 2 k 2k filters and then apply an additional pooling step. This means that the overall computational cost is dominated by the expensive convolution on the larger grid using 2 d 2 k 2 2d^2k^2 operations. One possibility would be to switch to pooling with convolution and therefore resulting in 2 ( d 2 ) 2 k 2 2(\frac{d}{2})^2k^2 reducing the computational cost by a quarter. However, this creates a representational bottlenecks as the overall dimensionality of the representation drops to ( d 2 ) 2 k (\frac{d}{2})^2k resulting in less expressive networks (see Figure 9). Instead of doing so, we suggest another variant the reduces the computational cost even further while removing the representational bottleneck. (see Figure 10). We can use two parallel stride 2 blocks: P and C. P is a pooling layer (either average or maximum pooling) the activation, both of them are stride 2 the filter banks of which are concatenated as in figure 10.

传统上,卷积网络使用某种池化操作来减小特征图的网格大小。为了避免出现表征瓶颈,在应用最大或平均池化之前,将扩展网络滤波器的激活维度。例如,从一个具有 k k 个滤波器的 d × d d×d 网格开始,如果我们想获得一个具有 2 k 2k 滤波器的 d 2 × d 2 \frac{d}{2}×\frac{d}{2} 网格,我们首先需要计算具有 2 k 2k 个滤波器的 s t r i d e ? 1 stride-1 卷积,然后应用附加的池化步骤。这意味着,使用 2 d 2 k 2 2d^2k^2 运算在较大的网格上进行昂贵的卷积运算将占总体计算成本的主导。一种可能性是切换到带卷积的池,因此导致 2 ( d 2 ) 2 k 2 2(\frac{d}{2})^2k^2 将计算成本降低四分之一。但是,这会造成表示瓶颈,因为表示的整体维数下降到 ( d 2 ) 2 k (\frac{d}{2})^2k ,导致表达网络较少(参考图9)。而不是这样做,我们建议另一个变体,在消除代表性瓶颈的同时,进一步降低计算成本(参考图10)。我们可以使用两个并行的步幅为2的块:P和C。P是激活的池化层(平均池或最大池化),它们的步幅都为2,其滤波器组如图10所示。
在这里插入图片描述
Figure 8. Auxiliary classifier on top of the last 17×17 layer. Batch normalization[7] of the layers in the side head results in a 0.4% absolute gain in top-1 accuracy. The lower axis shows the number of itertions performed, each with batch size 32.

图8.最后17×17层顶部的辅助分类器。侧面机头中各层的批处理归一化[7]使top-1精度的绝对增益提高0.4%。下轴显示了执行的迭代次数,每个迭代的批次大小为32。
在这里插入图片描述
Figure 9. Two alternative ways of reducing the grid size. The solution on the left violates the principle 1 of not introducing an representational bottleneck from Section 2. The version on the right is 3 times more expensive computationally.

图9.减小网格大小的两种替代方法。左边的解决方案违反了原则1,即不引入第2节中的代表性瓶颈。右边的版本在计算上要贵3倍。
在这里插入图片描述
Figure 10. Inception module that reduces the grid-size while expands the filter banks. It is both cheap and avoids the representational bottleneck as is suggested by principle 1. The diagram on the right represents the same solution but from the perspective of grid sizes rather than the operations.

图10.初始模块,可减小网格大小,同时扩展滤波器组。它既便宜又避免了原则1所建议的代表性瓶颈。右图表示相同的解决方案,但是是从网格大小而不是操作的角度来看的。


6. Inception-v2

Here we are connecting the dots from above and propose a new architecture with improved performance on the ILSVRC 2012 classification benchmark. The layout of our network is given in table 1. Note that we have factorized the traditional 7 × 7 convolution into three 3 × 3 convolutions based on the same ideas as described in section 3.1. For the Inception part of the network, we have 3 traditional inception modules at the 35×35 with 288 filters each. This is reduced to a 17 × 17 grid with 768 filters using the grid reduction technique described in section 5. This is is followed by 5 instances of the factorized inception modules as depicted in figure 5. This is reduced to a 8 × 8 × 1280 grid with the grid reduction technique depicted in figure 10. At the coarsest 8 × 8 level, we have two Inception modules as depicted in figure 6, with a concatenated output filter bank size of 2048 for each tile. The detailed structure of the network, including the sizes of filter banks inside the Inception modules, is given in the supplementary material, given in the model.txt that is in the tar-file of this submission.

在这里,我们从上方连接各个点,并根据ILSVRC 2012分类基准提出了一种具有改进性能的新体系结构。表1给出了我们网络的布局。请注意,基于第3.1节所述的相同思想,我们已将传统的7×7卷积分解为三个3×3卷积。对于网络的Inception部分,我们在35×35处有3个传统的Inception模块,每个模块有288个过滤器。使用第5节中描述的网格缩减技术,可以将其缩减为具有768个滤波器的17×17网格。随后是5个实例化的初始模块的实例,如图5所示。这可以缩减为8×8×1280。网格使用图10所示的网格缩减技术。在最粗的8×8级别上,我们有两个Inception模块,如图6所示,每个图块的串联输出滤波器组大小为2048。该网络的详细结构,包括Inception模块内部的滤波器组的大小,在补充材料中给出,该材料在此提交的tar文件中的model.txt中给出。
在这里插入图片描述
Table 1. The outline of the proposed network architecture. The output size of each module is the input size of the next one. We are using variations of reduction technique depicted Figure 10 to reduce the grid sizes between the Inception blocks whenever applicable. We have marked the convolution with 0-padding, which is used to maintain the grid size. 0-padding is also used inside those Inception modules that do not reduce the grid size. All other layers do not use padding. The various filter bank sizes are chosen to observe principle 4 from Section 2.

表1.提出的网络体系结构概述。每个模块的输出大小是下一个模块的输入大小。只要适用,我们将使用图10所示的简化技术的变种来减小Inception块之间的网格大小。我们用0-padding标记了卷积,用于维持网格大小。在那些不会减小网格大小的Inception模块中也使用了0填充。所有其他层都不使用填充。选择各种滤波器组尺寸以遵守第2节中的原则4。

However, we have observed that the quality of the network is relatively stable to variations as long as the principles from Section 2 are observed. Although our network is 42 layers deep, our computation cost is only about 2.5 higher than that of GoogLeNet and it is still much more efficient than VGGNet.

但是,我们已经观察到,只要遵守第2节中的原则,网络的质量对于变化而言就相对稳定。尽管我们的网络深达42层,但我们的计算成本仅比GoogLeNet高约2.5倍,并且仍然比VGGNet高效得多。


7. Model Regularization via Label Smoothing

Here we propose a mechanism to regularize the classifier layer by estimating the marginalized effect of label-dropout during training.

在这里,我们提出了一种机制,可以通过估计训练过程中 label-dropout 的边缘化作用来对分类器层进行正则化。

For each training example x, our model computes the probability of each label k 1... K : p ( k x ) = e x p ( z k ) i = 1 K e x p ( z i ) k ∈ {1. . . K}: p(k|x) = \frac{exp(z_k)}{\sum^K_{i=1}exp(z_i)} . Here, z i z_i are the logits or unnormalized logprobabilities. Consider the ground-truth distribution over labels q ( k x ) q(k|x) for this training example, normalized so that k q ( k x ) = 1 \sum_k q(k|x) = 1 . For brevity, let us omit the dependence of p and q on example x. We define the loss for the example as the cross entropy: l = ? k = 1 K l o g ( p ( k ) ) q ( k ) l = ?\sum^K_{k=1}log(p(k))q(k) . Minimizing this is equivalent to maximizing the expected log-likelihood of a label, where the label is selected according to its ground-truth distribution q ( k ) q(k) . Cross-entropy loss is differentiable with respect to the logits z k z_k and thus can be used for gradient training of deep models. The gradient has a rather simple form: ? l ? z k = p ( k ) ? q ( k ) \frac{?l}{?z_k}= p(k)?q(k) , which is bounded between ? 1 ?1 and 1 1 .

对于每个训练示例 x x ,我们的模型都会计算每个标签 k 1... K : p ( k x ) = e x p ( z k ) i = 1 K e x p ( z i ) k ∈ {1. . . K}: p(k|x) = \frac{exp(z_k)}{\sum^K_{i=1}exp(z_i)} 的概率。在这里, z i z_i 是logits或未归一化的对数概率。在本训练示例中,考虑标签 q ( k x ) q(k|x) 上的真实标签分布,并对其进行归一化,使得 k q ( k x ) = 1 \sum_k q(k|x) = 1 。为简便起见,让我们忽略 p p q q 对示例 x x 的依赖性。我们将示例中的损失定义为交叉熵: l = ? k = 1 K l o g ( p ( k ) ) q ( k ) l = ?\sum^K_{k=1}log(p(k))q(k) 。将其最小化等效于最大化标签的预期对数似然性,其中根据标签的真实标签分布 q ( k ) q(k) 选择标签。交叉熵损失相对于logit z k z_k 是可区分的,因此可用于深度模型的梯度训练。梯度具有一个相当简单的形式: ? l ? z k = p ( k ) ? q ( k ) \frac{?l}{?z_k}= p(k)?q(k) ,范围在 ? 1 ?1 1 1 之间。

Consider the case of a single ground-truth label y y , so that q ( y ) = 1 q(y) = 1 and q ( k ) = 0 q(k) = 0 for all k y k \ne y . In this case, minimizing the cross entropy is equivalent to maximizing the log-likelihood of the correct label. For a particular example x with label y, the log-likelihood is maximized for q ( k ) = δ k , y q(k) = δ_{k,y} , where δ k , y δ_{k,y} is Dirac delta, which equals 1 for k = y and 0 otherwise. This maximum is not achievable for finite zk but is approached if z y ? z k zy \gg z_k for all k=y – that is, if the logit corresponding to the ground-truth label is much great than all other logits. This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the groundtruth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient ? l ? z k \frac{?l}{?z_k} , reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions.

考虑单个真实标签 y y 的情况,因此对于所有 k y k \ne y q ( y ) = 1 q(y) = 1 q ( k ) = 0 q(k) = 0 。在这种情况下,最小化交叉熵等效于最大化正确标签的对数似然性。对于带有标签 y y 的特定示例 x x ,对于 q ( k ) = δ k , y q(k) = δ_{k,y} ,对数似然最大化,其中 δ k , y δ_{k,y} 是狄拉克增量(Dirac delta),对于 k = y k = y 等于1,否则为0。对于有限的 z k z_k 来说,无法达到此最大值,但是对于所有 k = y k = y 来说,如果 z y ? z k zy \gg z_k ,即与真实标签相对应的对数比所有其他对数大得多时,就可以接近该最大值。但是,这可能导致两个问题。首先,它可能导致过度拟合:如果模型学会为每个训练示例分配全部概率给真实标签,则不能保证将其推广。其次,它鼓励最大的logit与所有其他logit之间的差异变大,并且这与有界梯度 ? l ? z k \frac{?l}{?z_k} 相结合,降低了模型的自适应能力。直观地讲,发生这种情况是因为模型对其预测变得过于自信。

We propose a mechanism for encouraging the model to be less confident. While this may not be desired if the goal is to maximize the log-likelihood of training labels, it does regularize the model and makes it more adaptable. The method is very simple. Consider a distribution over labels u ( k ) u(k) , independent of the training example x x , and a smoothing parameter ? \epsilon For a training example with ground-truth label y y , we replace the label distribution q ( k x ) = δ k , y q(k|x) = δ_{k,y} with

我们提出了一种鼓励模型不那么自信的机制。如果目标是最大化训练标签的对数似然性,虽然这可能不是期望的,但它确实使模型正规化并使其更具适应性。该方法非常简单。考虑标签 u ( k ) u(k) 分布,独立于训练示例 x x 和平滑参数。对于带有真实标签 y y 的训练示例,我们将标签分布 q ( k x ) = δ k , y q(k|x) = δ_{k,y} 替换为

在这里插入图片描述
which is a mixture of the original ground-truth distribution q ( k x ) q(k|x) and the fixed distribution u ( k ) u(k) , with weights 1 ? ? and ?, respectively. This can be seen as the distribution of the label k obtained as follows: first, set it to the groundtruth label k = y k = y ; then, with probability ?, replace k with a sample drawn from the distribution u ( k ) u(k) . We propose to use the prior distribution over labels as u(k). In our experiments, we used the uniform distribution u ( k ) = 1 / K u(k) = 1/K , so that

它是原始真实标签分布 q ( k x ) q(k|x) 和固定分布u(k)的混合,权重为1-?和?。这可以看作是获得的标签k的分布,如下所示:首先,将其设置为地面标签 k = y k = y ;然后,用概率 α α k k 替换为从分布 u ( k ) u(k) 得出的样本。我们建议使用标签上的先验分布为 u ( k ) u(k) 。在我们的实验中,我们使用均匀分布 u ( k ) = 1 / K u(k) = 1/K ,因此
在这里插入图片描述
We refer to this change in ground-truth label distribution as label-smoothing regularization, or LSR.
我们将真实标签分配的这种变化称为标签平滑正则化或LSR。

Note that LSR achieves the desired goal of preventing the largest logit from becoming much larger than all others. Indeed, if this were to happen, then a single q ( k ) q(k) would approach 1 while all others would approach 0. This would result in a large cross-entropy with q ( k ) q'(k) because, unlike q ( k ) = δ k , y q(k) = δ_{k,y} , all q ( k ) q'(k) have a positive lower bound.

请注意,LSR实现了防止最大logit变得比其他所有对象大得多的预期目标。确实,如果发生这种情况,则单个 q ( k ) q(k) 将接近1,而所有其他 q ( k ) q'(k) 将接近0。这将导致 q ( k ) q'(k) 产生较大的交叉熵,因为与 q ( k ) = δ k , y q(k) = δ_{k,y} 不同,所有 q ( k ) q'(k) 都有一个正的下限。

Another interpretation of LSR can be obtained by considering the cross entropy:
LSR的另一种解释可以通过考虑交叉熵来获得:


Thus, LSR is equivalent to replacing a single cross-entropy loss H ( q , p ) H(q, p) with a pair of such losses H ( q , p ) H(q, p) and H ( u , p ) H(u, p) .

因此,LSR等效于用一对这样的损耗 H ( q , p ) H(q, p) H ( u , p ) H(u, p) 代替单个交叉熵损耗 H ( q , p ) H(q, p)

The second loss penalizes the deviation of predicted label distribution p from the prior u, with the relative weight ? 1 ? ? \frac{\epsilon}{1-\epsilon} . Note that this deviation could be equivalently captured by the KL divergence, since H ( u , p ) = D K L ( u p ) + H ( u ) H(u, p) = D_{KL}(u||p) + H(u) and H ( u ) H(u) is fixed. When u u is the uniform distribution, H ( u , p ) H(u, p) is a measure of how dissimilar the predicted distribution p p is to uniform, which could also be measured (but not equivalently) by negative entropy ? H ( p ) ?H(p) ; we have not experimented with this approach.

第二次损失惩罚了预测标签分布 p p 与先前 u u 的偏差,其相对权重为 ? 1 ? ? \frac{\epsilon}{1-\epsilon} 。注意,由于 H ( u , p ) = D K L ( u p ) + H ( u ) H(u, p) = D_{KL}(u||p) + H(u) H ( u ) H(u) 是固定的,因此可以用KL散度等效地捕获此偏差。当u是均匀分布时, H ( u , p ) H(u, p) 是预测分布 p p 与均匀性有多不同的度量,也可以通过负熵 ? H ( p ) ?H(p) 来测量(但不能等效)。我们还没有尝试过这种方法。

In our ImageNet experiments with K = 1000 K = 1000 classes, we used u ( k ) = 1 / 1000 u(k) = 1/1000 and ? = 0.1 \epsilon = 0.1 . For ILSVRC 2012, we have found a consistent improvement of about 0.2 0.2% absolute both for top-1 error and the top-5 error (cf. Table 3).

在我们的 K = 1000 K = 1000 类的ImageNet实验中,我们使用 u ( k ) = 1 / 1000 u(k) = 1/1000 ? = 0.1 \epsilon = 0.1 。对于ILSVRC 2012,我们发现top-1错误和top-5错误的绝对值一致提高了约 0.2 0.2% (参见表3)。


8. Training Methodology

We have trained our networks with stochastic gradient utilizing the TensorFlow [1] distributed machine learning system using 50 replicas running each on a NVidia Kepler GPU with batch size 32 for 100 epochs. Our earlier experiments used momentum [19] with a decay of 0.9, while our best models were achieved using RMSProp [21] with decay of 0.9 and ? = 1.0 \epsilon=1.0 . We used a learning rate of 0.045, decayed every two epoch using an exponential rate of 0.94. In addition, gradient clipping [14] with threshold 2.0 was found to be useful to stabilize the training. Model evaluations are performed using a running average of the parameters computed over time.

我们使用TensorFlow [1]分布式机器学习系统以随机梯度训练了我们的网络,该系统使用了50个副本,每个副本在NVidia Kepler GPU上运行,批处理大小为32,共100个epoch。我们较早的实验使用了衰减为0.9的动量[19],而我们的最佳模型是使用衰减为0.9和 ? = 1.0 \epsilon=1.0 的RMSProp [21]实现的。我们使用的学习率为0.045,每两个epoch使用0.94的指数衰减。另外,发现阈值2.0的梯度裁剪[14]对于稳定训练很有用。使用随时间计算的参数的运行平均值执行模型评估。


9. Performance on Lower Resolution Input

A typical use-case of vision networks is for the the postclassification of detection, for example in the Multibox [4] context. This includes the analysis of a relative small patch of the image containing a single object with some context. The tasks is to decide whether the center part of the patch corresponds to some object and determine the class of the object if it does. The challenge is that objects tend to be relatively small and low-resolution. This raises the question of how to properly deal with lower resolution input.

视觉网络的典型用例是用于检测的后分类,例如在Multibox [4]上下文中。这包括分析图像的相对较小的补丁(patch),其中包含具有某些上下文的单个对象(object)。任务是确定补丁的中心部分是否对应于某个对象,并确定对象的类别(如果存在)。挑战在于对象往往相对较小且分辨率较低。这就提出了一个问题,即如何正确处理较低分辨率的输入。

The common wisdom is that models employing higher resolution receptive fields tend to result in significantly improved recognition performance. However it is important to distinguish between the effect of the increased resolution of the first layer receptive field and the effects of larger model capacitance and computation. If we just change the resolution of the input without further adjustment to the model, then we end up using computationally much cheaper models to solve more difficult tasks. Of course, it is natural, that these solutions loose out already because of the reduced computational effort. In order to make an accurate assessment, the model needs to analyze vague hints in order to be able to “hallucinate” the fine details. This is computationally costly. The question remains therefore: how much does higher input resolution helps if the computational effort is kept constant. One simple way to ensure constant effort is to reduce the strides of the first two layer in the case of lower resolution input, or by simply removing the first pooling layer of the network.For this purpose we have performed the following three experiments:

普遍的看法是,采用高分辨率接收场的模型往往会导致识别性能大大提高。但是,重要的是要区分第一层接收场分辨率提高的影响和较大的模型电容和计算的影响。如果仅更改输入的分辨率而无需进一步调整模型,则最终会使用计算上便宜得多的模型来解决更困难的任务。当然,由于减少了计算量,这些解决方案已经松散了,这是很自然的。为了进行准确的评估,该模型需要分析模糊的提示,以便能够“细化”精细的细节。这在计算上是昂贵的。因此,问题仍然存在:如果将计算工作保持不变,那么更高的输入分辨率有多少帮助。确保持续努力的一种简单方法是在较低分辨率输入的情况下减小前两层的步幅,或者仅删除网络的第一层池化层。为此,我们执行了以下三个实验:

  • 299 × 299 receptive field with stride 2 and maximum pooling after the first layer.
  • 步幅为2且第一层之后为最大池化的299×299接收场。
  • 151 × 151 receptive field with stride 1 and maximum pooling after the first layer.
  • 步幅为1且最大池化后的151×151接收场。
  • 79×79 receptive field with stride 1 and without pooling after the first layer.
  • 79×79的接收场,步幅为1,在第一层之后没有池化。

All three networks have almost identical computational cost. Although the third network is slightly cheaper, the cost of the pooling layer is marginal and (within 1% of the total cost of the)network. In each case, the networks were trained until convergence and their quality was measured on the validation set of the ImageNet ILSVRC 2012 classification benchmark. The results can be seen in table 2. Although the lower-resolution networks take longer to train, the quality of the final result is quite close to that of their higher resolution counterparts.

这三个网络的计算成本几乎相同。尽管第三个网络便宜一些,但是池化层的成本很小,并且(在网络总成本的1%之内)。在每种情况下,都对网络进行培训,直到收敛为止,然后根据ImageNet ILSVRC 2012分类基准的验证集对网络的质量进行测量。结果可以在表2中看到。尽管较低分辨率的网络需要花费较长的训练时间,但最终结果的质量却与较高分辨率的网络相当接近。

However, if one would just naively reduce the network size according to the input resolution, then network would perform much more poorly. However this would an unfair comparison as we would are comparing a 16 times cheaper model on a more difficult task.

但是,如果仅根据输入分辨率天真地减小网络大小,则网络性能会差很多。但是,这将是不公平的比较,因为我们将在较困难的任务上比较便宜16倍的模型。

Also these results of table 2 suggest, one might consider using dedicated high-cost low resolution networks for smaller objects in the R-CNN [5] context.

表2的这些结果也表明,人们可能会考虑对R-CNN [5]上下文中的较小对象使用专用的高成本低分辨率网络。


10. Experimental Results and Comparisons

Table 3 shows the experimental results about the recognition performance of our proposed architecture (Inceptionv2) as described in Section 6. Each Inception-v2 line shows the result of the cumulative changes including the highlighted new modification plus all the earlier ones. Label Smoothing refers to method described in Section 7. Factorized 7 × 7 includes a change that factorizes the first 7 × 7 convolutional layer into a sequence of 3 × 3 convolutional layers. BN-auxiliary refers to the version in which the fully connected layer of the auxiliary classifier is also batch-normalized, not just the convolutions. We are referring to the model in last row of Table 3 as Inception-v3 and evaluate its performance in the multi-crop and ensemble settings.

表3显示了第6节中描述的有关我们提出的体系结构(Inceptionv2)的识别性能的实验结果。每条Inception-v2行显示了累积更改的结果,包括突出显示的新修改以及所有较早的修改。标签平滑指的是第7节中描述的方法。分解7×7包括将第一个7×7卷积层分解为3×3卷积层序列的更改。 BN-auxiliary指的是其中辅助分类器的完全连接层也进行批标准化的版本,而不仅仅是卷积。我们将表3最后一行中的模型称为Inception-v3,并在多裁剪和集成设置下评估其性能。

All our evaluations are done on the 48238 nonblacklisted examples on the ILSVRC-2012 validation set, as suggested by [16]. We have evaluated all the 50000 examples as well and the results were roughly 0.1% worse in top-5 error and around 0.2% in top-1 error. In the upcoming version of this paper, we will verify our ensemble result on the test set, but at the time of our last evaluation of BNInception in spring [7] indicates that the test and validation set error tends to correlate very well.、

正如[16]所建议的,我们所有的评估都是根据ILSVRC-2012验证集上的48238个未列入黑名单的示例进行的。我们还评估了所有50000个示例,结果在前5个错误中差了约0.1%,在前1个错误中差了约0.2%。在本文即将发布的版本中,我们将在测试集上验证我们的总体结果,但是在春天[7]上对BNInception的上一次评估时,表明测试和验证集的误差往往具有很好的相关性。

在这里插入图片描述
Table 3. Single crop experimental results comparing the cumulative effects on the various contributing factors. We compare our numbers with the best published single-crop inference for Ioffe at al [7]. For the “Inception-v2” lines, the changes are cumulative and each subsequent line includes the new change in addition to the previous ones. The last line is referring to all the changes is what we refer to as “Inception-v3” below. Unfortunately, He et al [6] reports the only 10-crop evaluation results, but not single crop results, which is reported in the Table 4 below.

表3.单一作物的实验结果,比较了对各种影响因素的累积影响。我们将数字与艾菲等人发表的最佳单作推论进行了比较[7]。对于“ Inception-v2”行,更改是累积的,每个后续行除先前的更改外还包括新更改。最后一行是指所有更改,以下是我们称为“ Inception-v3”的内容。不幸的是,He等人[6]仅报告了10个作物的评估结果,但没有报告单次作物的结果,这些结果在下表4中进行了报告。


Table 4. Single-model, multi-crop experimental results comparing the cumulative effects on the various contributing factors. We compare our numbers with the best published single-model inference results on the ILSVRC 2012 classification benchmark.

表4.单模型,多作物实验结果,比较了对各种影响因素的累积影响。我们将我们的数字与ILSVRC 2012分类基准上发布的最佳单模型推断结果进行比较。


Table 5. Ensemble evaluation results comparing multi-model, multi-crop reported results. Our numbers are compared with the best published ensemble inference results on the ILSVRC 2012 classification benchmark. ?All results, but the top-5 ensemble result reported are on the validation set. The ensemble yielded 3.46% top-5 error on the validation set.

表5.比较多模型,多作物报告结果的集合评估结果。将我们的数字与ILSVRC 2012分类基准上已发布的最佳整体推理结果进行比较。 ?所有结果,但报告的前5个整体结果均在验证集上。集成在验证集上产生3.46%的top-5错误。


11. Conclusions

We have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture. This guidance can lead to high performance vision networks that have a relatively modest computation cost compared to simpler, more monolithic architectures. Our highest quality version of Inception-v3 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, setting a new state of the art. This is achieved with relatively modest (2.5×) increase in computational cost compared to the network described in Ioffe et al [7]. Still our solution uses much less computation than the best published results based on denser networks: our model outperforms the results of He et al [6] – cutting the top-5 (top-1) error by 25% (14%) relative, respectively – while being six times cheaper computationally and using at least five times less parameters (estimated). Our ensemble of four Inception-v3 models reaches 3.5% with multi-crop evaluation reaches 3.5% top5 error which represents an over 25% reduction to the best published results and is almost half of the error of ILSVRC 2014 winining GoogLeNet ensemble.

我们提供了几种设计原则来扩大卷积网络,并在Inception体系结构的背景下对其进行了研究。与更简单,更单一的体系结构相比,该指南可导致高性能视觉网络具有相对适中的计算成本。在ILSVR 2012分类中,针对单作物评估,我们质量最高的Inception-v3版本达到21.2%,top-1和5.6%top-5错误,开创了新的技术水平。与Ioffe等[7]中描述的网络相比,这是通过相对适度(2.5倍)的计算成本增加来实??现的。与基于密集网络的最佳结果相比,我们的解决方案使用的计算量仍然要少得多:我们的模型优于He等人的结果[6] –将top-5(top-1)误差减少了25%(14%)相对,分别–在计算上便宜六倍,并且使用的参数至少少五倍(估计)。我们的四个Inception-v3模型合计达到3.5%,多播评估达到3.5%top5错误,这比最佳发布结果减少了25%以上,几乎是ILSVRC 2014赢得GoogLeNet合奏错误的一半。

We have also demonstrated that high quality results can be reached with receptive field resolution as low as 79×79. This might prove to be helpful in systems for detecting relatively small objects. We have studied how factorizing convolutions and aggressive dimension reductions inside neural network can result in networks with relatively low computational cost while maintaining high quality. The combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label-smoothing allows for training high quality networks on relatively modest sized training sets.

我们还证明了在低至79×79的接收场分辨率下可以获得高质量的结果。在检测较小物体的系统中,这可能会有所帮助。我们已经研究了神经网络内部的分解卷积和主动降维如何在保持高质量的同时,使网络具有较低的计算成本。较低的参数数量和额外的正则化与批归一化的辅助分类器和标签平滑的结合,可以在相对中等大小的训练集上训练高质量的网络。

  相关解决方案