当前位置: 代码迷 >> 综合 >> YOLO,You Only Look Once论文翻译——中英文对照
  详细解决方案

YOLO,You Only Look Once论文翻译——中英文对照

热度:27   发布时间:2023-12-12 21:33:11.0

文章作者:Tyan
博客:noahsnail.com  |  CSDN  |  简书

声明:作者翻译论文仅为学习,如有侵权请联系作者删除博文,谢谢!

翻译论文汇总:https://github.com/SnailTyan/deep-learning-papers-translation

You Only Look Once: Unified, Real-Time Object Detection

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

摘要

我们提出了YOLO,一种新的目标检测方法。以前的目标检测工作重新利用分类器来执行检测。相反,我们将目标检测框架看作回归问题从空间上分割边界框和相关的类别概率。单个神经网络在一次评估中直接从完整图像上预测边界框和类别概率。由于整个检测流水线是单一网络,因此可以直接对检测性能进行端到端的优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

我们的统一架构非常快。我们的基础YOLO模型以45帧/秒的速度实时处理图像。网络的一个较小版本,快速YOLO,每秒能处理惊人的155帧,同时实现其它实时检测器两倍的mAP。与最先进的检测系统相比,YOLO产生了更多的定位误差,但不太可能在背景上的预测假阳性。最后,YOLO学习目标非常通用的表示。当从自然图像到艺术品等其它领域泛化时,它都优于其它检测方法,包括DPM和R-CNN。

1. Introduction

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

1. 引言

人们瞥一眼图像,立即知道图像中的物体是什么,它们在哪里以及它们如何相互作用。人类的视觉系统是快速和准确的,使我们能够执行复杂的任务,如驾驶时没有多少有意识的想法。快速,准确的目标检测算法可以让计算机在没有专门传感器的情况下驾驶汽车,使辅助设备能够向人类用户传达实时的场景信息,并表现出对一般用途和响应机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].

目前的检测系统重用分类器来执行检测。为了检测目标,这些系统为该目标提供一个分类器,并在不同的位置对其进行评估,并在测试图像中进行缩放。像可变形部件模型(DPM)这样的系统使用滑动窗口方法,其分类器在整个图像的均匀间隔的位置上运行[10]。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

最近的方法,如R-CNN使用区域提出方法首先在图像中生成潜在的边界框,然后在这些提出的框上运行分类器。在分类之后,后处理用于细化边界框,消除重复的检测,并根据场景中的其它目标重新定位边界框[13]。这些复杂的流程很慢,很难优化,因为每个单独的组件都必须单独进行训练。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

我们将目标检测重新看作单一的回归问题,直接从图像像素到边界框坐标和类概率。使用我们的系统,您只需要在图像上看一次(YOLO),以预测出现的目标和位置。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.

YOLO很简单:参见图1。单个卷积网络同时预测这些盒子的多个边界框和类概率。YOLO在全图像上训练并直接优化检测性能。这种统一的模型比传统的目标检测方法有一些好处。

图1:YOLO检测系统。用YOLO处理图像简单直接。我们的系统(1)将输入图像调整为448×448,(2)在图像上运行单个卷积网络,以及(3)由模型的置信度对所得到的检测进行阈值处理。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.

首先,YOLO速度非常快。由于我们将检测视为回归问题,所以我们不需要复杂的流程。测试时我们在一张新图像上简单的运行我们的神经网络来预测检测。我们的基础网络以每秒45帧的速度运行,在Titan X GPU上没有批处理,快速版本运行速度超过150fps。这意味着我们可以在不到25毫秒的延迟内实时处理流媒体视频。此外,YOLO实现了其它实时系统两倍以上的平均精度。关于我们的系统在网络摄像头上实时运行的演示,请参阅我们的项目网页:http://pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

其次,YOLO在进行预测时,会对图像进行全面地推理。与基于滑动窗口和区域提出的技术不同,YOLO在训练期间和测试时会看到整个图像,所以它隐式地编码了关于类的上下文信息以及它们的外观。快速R-CNN是一种顶级的检测方法[14],因为它看不到更大的上下文,所以在图像中会将背景块误检为目标。与快速R-CNN相比,YOLO的背景误检数量少了一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

第三,YOLO学习目标的泛化表示。当在自然图像上进行训练并对艺术作品进行测试时,YOLO大幅优于DPM和R-CNN等顶级检测方法。由于YOLO具有高度泛化能力,因此在应用于新领域或碰到意外的输入时不太可能出故障。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

YOLO在精度上仍然落后于最先进的检测系统。虽然它可以快速识别图像中的目标,但它仍在努力精确定位一些目标,尤其是小的目标。我们在实验中会进一步检查这些权衡。

All of our training and testing code is open source. A variety of pretrained models are also available to download.

我们所有的训练和测试代码都是开源的。各种预训练模型也都可以下载。

2. Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

2. 统一检测

我们将目标检测的单独组件集成到单个神经网络中。我们的网络使用整个图像的特征来预测每个边界框。它还可以同时预测一张图像中的所有类别的所有边界框。这意味着我们的网络全面地推理整张图像和图像中的所有目标。YOLO设计可实现端到端训练和实时的速度,同时保持较高的平均精度。

Our system divides the input image into an S×S S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

我们的系统将输入图像分成 S×S S × S 的网格。如果一个目标的中心落入一个网格单元中,该网格单元负责检测该目标。

Each grid cell predicts B B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr ( Object ) ? IOU pred truth . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每个网格单元预测这些盒子的 B B 个边界框和置信度分数。这些置信度分数反映了该模型对盒子是否包含目标的信心,以及它预测盒子的准确程度。在形式上,我们将置信度定义为 Pr ( Object ) ? IOU pred truth 。如果该单元格中不存在目标,则置信度分数应为零。否则,我们希望置信度分数等于预测框与真实值之间联合部分的交集(IOU)。

Each bounding box consists of 5 predictions: x x , y , w w , h , and confidence. The (x,y) ( x , y ) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每个边界框包含5个预测: x x y w w h 和置信度。 (xy) ( x , y ) 坐标表示边界框相对于网格单元边界框的中心。宽度和高度是相对于整张图像预测的。最后,置信度预测表示预测框与实际边界框之间的IOU。

Each grid cell also predicts C C conditional class probabilities, Pr ( Class i | Object ) . These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B B .

每个网格单元还预测 C 个条件类别概率 Pr(Classi|Object) Pr ( Class i | Object ) 。这些概率以包含目标的网格单元为条件。每个网格单元我们只预测的一组类别概率,而不管边界框的的数量 B B 是多少。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

Pr ( Class i | Object ) \* Pr ( Object ) \* IOU pred truth = Pr ( Class i ) \* IOU pred truth
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

在测试时,我们乘以条件类概率和单个盒子的置信度预测,

Pr(Classi|Object)\*Pr(Object)\*IOUpredtruth=Pr(Classi)\*IOUpredtruth Pr ( Class i | Object ) \* Pr ( Object ) \* IOU pred truth = Pr ( Class i ) \* IOU pred truth
它为我们提供了每个框特定类别的置信度分数。这些分数编码了该类出现在框中的概率以及预测框拟合目标的程度。

For evaluating YOLO on Pascal VOC, we use S=7 S = 7 , B=2 B = 2 . Pascal VOC has 20 labelled classes so C=20 C = 20 . Our final prediction is a 7×7×30 7 × 7 × 30 tensor.

The Model. Our system models detection as a regression problem. It divides the image into an S×S S × S grid and for each grid cell predicts B B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S×S×(B\*5+C) S × S × ( B \* 5 + C ) tensor.

为了在Pascal VOC上评估YOLO,我们使用 S=7 S = 7 B=2 B = 2 。Pascal VOC有20个标注类,所以 C=20 C = 20 。我们最终的预测是 7×7×30 7 × 7 × 30 的张量。

模型。 我们的系统将检测建模为回归问题。它将图像分成 S×S S × S 的网格,并且每个网格单元预测 B B 个边界框,这些边界框的置信度以及 C 个类别概率。这些预测被编码为 S×S×(B\*5+C) S × S × ( B \* 5 + C ) 的张量。

2.1. Network Design

We implement this model as a convolutional neural network and evaluate it on the Pascal VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

2.1. 网络设计

我们将此模型作为卷积神经网络来实现,并在Pascal VOC检测数据集[9]上进行评估。网络的初始卷积层从图像中提取特征,而全连接层预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1×1 1 × 1 reduction layers followed by 3×3 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1×1 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution ( 224×224 224 × 224 input image) and then double the resolution for detection.

我们的网络架构受到GoogLeNet图像分类模型的启发[34]。我们的网络有24个卷积层,后面是2个全连接层。我们只使用 1×1 1 × 1 降维层,后面是 3×3 3 × 3 卷积层,这与Lin等人[22]类似,而不是GoogLeNet使用的Inception模块。完整的网络如图3所示。

图3:架构。我们的检测网络有24个卷积层,其次是2个全连接层。交替 1×1 1 × 1 卷积层减少了前面层的特征空间。我们在ImageNet分类任务上以一半的分辨率( 224×224 224 × 224 的输入图像)预训练卷积层,然后将分辨率加倍来进行检测。

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1×1 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution ( 224×224 224 × 224 input image) and then double the resolution for detection.

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

我们还训练了快速版本的YOLO,旨在推动快速目标检测的界限。快速YOLO使用具有较少卷积层(9层而不是24层)的神经网络,在这些层中使用较少的滤波器。除了网络规模之外,YOLO和快速YOLO的所有训练和测试参数都是相同的。

The final output of our network is the 7×7×30 7 × 7 × 30 tensor of predictions.

我们网络的最终输出是 7×7×30 7 × 7 × 30 的预测张量。

2.2. Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% 88 % on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].

2.2. 训练

我们在ImageNet 1000类竞赛数据集[30]上预训练我们的卷积图层。对于预训练,我们使用图3中的前20个卷积层,接着是平均池化层和全连接层。我们对这个网络进行了大约一周的训练,并且在ImageNet 2012验证集上获得了单一裁剪图像 88% 88 % top-5准确率,与Caffe模型池中的GoogLeNet模型相当。我们使用Darknet框架进行所有的训练和推断[26]。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224×224 224 × 224 to 448×448 448 × 448 .

然后我们转换模型来执行检测。Ren等人表明,预训练网络中增加卷积层和连接层可以提高性能[29]。按照他们的例子,我们添加了四个卷积层和两个全连接层,并且具有随机初始化的权重。检测通常需要细粒度的视觉信息,因此我们将网络的输入分辨率从 224×224 224 × 224 变为 448×448 448 × 448

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

我们的最后一层预测类概率和边界框坐标。我们通过图像宽度和高度来规范边界框的宽度和高度,使它们落在0和1之间。我们将边界框 x x y 坐标参数化为特定网格单元位置的偏移量,所以它们边界也在0和1之间。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

?(x)=?????x,ifx>00.1x,otherwise ? ( x ) = { x , i f x > 0 0.1 x , o t h e r w i s e

我们对最后一层使用线性激活函数,所有其它层使用下面的漏泄修正线性激活:

?(x)=?????x,ifx>00.1x,otherwise ? ( x ) = { x , i f x > 0 0.1 x , o t h e r w i s e

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

我们优化了模型输出中的平方和误差。我们使用平方和误差,因为它很容易进行优化,但是它并不完全符合我们最大化平均精度的目标。分类误差与定位误差的权重是一样的,这可能并不理想。另外,在每张图像中,许多网格单元不包含任何对象。这将这些单元格的“置信度”分数推向零,通常压倒了包含目标的单元格的梯度。这可能导致模型不稳定,从而导致训练早期发散。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord λ coord and λnoobj λ noobj to accomplish this. We set λcoord=5 λ coord = 5 and λnoobj=.5 λ noobj = .5 .

为了改善这一点,我们增加了边界框坐标预测损失,并减少了不包含目标边界框的置信度预测损失。我们使用两个参数 λcoord λ coord λnoobj λ noobj 来完成这个工作。我们设置 λcoord=5 λ coord = 5 λnoobj=.5 λ noobj = .5

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

平方和误差也可以在大盒子和小盒子中同样加权误差。我们的错误指标应该反映出,大盒子小偏差的重要性不如小盒子小偏差的重要性。为了部分解决这个问题,我们直接预测边界框宽度和高度的平方根,而不是宽度和高度。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

YOLO每个网格单元预测多个边界框。在训练时,每个目标我们只需要一个边界框预测器来负责。我们指定一个预测器“负责”根据哪个预测与真实值之间具有当前最高的IOU来预测目标。这导致边界框预测器之间的专业化。每个预测器可以更好地预测特定大小,方向角,或目标的类别,从而改善整体召回率。