Semantic Relation Reasoning for Shot-Stable Few-Shot Object Detection
论文地址:https://arxiv.org/pdf/2103.01903.pdf
代码地址:无 o(╥﹏╥)o
tips:
长尾:少数类(头类)占用大部分的数据,而大多数类(尾类)只有少量的数据
word embedding:找到一个映射或者函数,生成在一个新的空间上的表达
摘要
Few-shot object detection is an imperative and longlasting problem due to the inherent long-tail distribution of real-world data. Its performance is largely affected by the data scarcity of novel classes. But the semantic relation between the novel classes and the base classes is constant regardless of the data availability. In this work, we investigate utilizing this semantic relation together with the visual information and introduce explicit relation reasoning into the learning of novel object detection. Specifically, we represent each class concept by a semantic embedding learned from a large corpus of text. The detector is trained to project the image representations of objects into this embedding space. We also identify the problems of trivially using the raw embeddings with a heuristic knowledge graph and propose to augment the embeddings with a dynamic relation graph. As a result, our few-shot detector, termed SRR-FSD, is robust and stable to the variation of shots of novel objects. Experiments show that SRR-FSD can achieve competitive results at higher shots, and more importantly, a significantly better performance given both lower explicit and implicit shots. The benchmark protocol with implicit shots removed from the pretrained classification dataset can serve as a more realistic setting for future research.
由于真实世界数据固有的长尾分布,小样本目标检测是一个迫切而持久的问题。它的性能很大程度上受到新类数据稀缺性的影响。但是,无论数据可用性如何,新类和基类之间的语义关系都是恒定的。在这项工作中,我们研究如何利用这种语义关系和视觉信息,并将显式关系推理引入到新目标检测的学习中。具体来说,我们通过从大量文本中学习的语义嵌入来表示每个类概念。训练检测器将物体的图像表示投影到这个嵌入空间中。我们还发现了用启发式知识图简单地使用原始嵌入的问题,并建议用动态关系图来扩充嵌入。因此,我们小样本检测器SRR-FSD对新目标类的变化具有鲁棒性和稳定性。实验表明,SRR-FSD可以在较多类样本的情况下获得有竞争力的结果,更重要的是,在较低的显式和隐式类样本下,SRR-FSD的性能显著提高。从预训练分类数据集中删除隐式样本的基准协议可以作为未来研究的更理想的设置。
介绍
Deep learning algorithms usually require a large amount of annotated data to achieve superior performance. To acquire enough annotated data, one common way is by collecting abundant samples from the real world and paying annotators to generate ground-truth labels. However, even if all the data samples are well annotated based on our requirements, we still face the problem of few-shot learning. Because long-tail distribution is an inherent characteristic of the real world, there always exist some rare cases that have just a few samples available, such as rare animals, uncommon road conditions. In other words, we are unable to alleviate the situation of scarce cases by simply spending more money on annotation even big data is accessible. Therefore, the study of few-shot learning is an imperative and long-lasting task.
深度学习算法通常需要大量带注释的数据才能获得优异的性能。为了获得足够的注释数据,一种常见的方法是从现实世界中收集大量样本,并向花费大量时间进行标注以生成ground-truth标签。然而,即使所有的数据样本都根据我们的需求进行了很好的标注,我们仍然面临着样本量很少的问题。由于长尾分布是现实世界的固有特征,因此总是存在一些只有少数样本可用的罕见案例,例如稀有动物、不常见的路况。换言之,我们无法通过简单地在注释上花费更多的钱来缓解稀缺案例的情况,即使大数据是可访问的。因此,对小样本学习的研究是一项紧迫而持久的任务。
Recently, efforts have been put into the study of few-shot object detection (FSOD) [5, 20, 11, 19, 44, 41, 14, 46, 39, 42, 43]. In FSOD, there are base classes in which sufficient objects are annotated with bounding boxes and novel classes in which very few labeled objects are available. The novel class set does not share common classes with the base class set. The few-shot detectors are expected to learn from limited data in novel classes with the aid of abundant data in base classes and to be able to detect all novel objects in a held-out testing set. To achieve this, most recent fewshot detection methods adopt the ideas from meta-learning and metric learning for few-shot recognition and apply them to conventional detection frameworks, e.g. Faster R-CNN [35], YOLO [34].
最近,人们致力于研究小样本目标检测(FSOD)[5,20,11,19,44,41,14,46,39,42,43]。在FSOD中,有一些基类,其中有足够的目标用边界框进行注释,还有一些新颖的类,其中几乎没有可用的标记目标。新类集不与基类集共享公共类。小样本目标检测器期望借助于基类中的丰富数据从新类中的有限数据中学习,并能够在一个固定的测试集中检测所有新目标。为了实现这一点,最新的小样本检测方法采用了元学习和度量学习的思想,用于小样本目标识别,并将其应用于传统检测框架,例如Faster R-CNN[35],YOLO[34]。
Although recent FSOD methods have improved the base-line considerably, data scarcity is still a bottleneck that hurts the detector’s generalization from a few samples. In other words, the performance is very sensitive to the number of both explicit and implicit shots and drops drastically as data becomes limited. The explicit shots refer to the available labeled objects from the novel classes. For example, the 1- shot performance of some FSOD methods is less than half of the 5-shot or 10-shot performance, as shown in Figure 1. In terms of implicit shots, initializing the backbone network with a model pretrained on a large-scale image classification dataset is a common practice for training an object detector. However, the classification dataset contains many implicit shots of object classes overlapped with the novel classes. So the detector can have early access to novel classes and encode their knowledge in the parameters of the backbone. Removing those implicit shots from the pretrained dataset also has a negative impact on the performance as shown in Figure 1. The variation of explicit and implicit shots could potentially lead to system failure when dealing with extreme cases in the real world.
尽管最近的FSOD方法已经大大改善了基线,但数据匮乏仍然是一个瓶颈,它会从几个样本中影响检测器的泛化。换句话说,性能对显式和隐式样本的数量非常敏感,并且随着数据变得有限而急剧下降。显式样本指的是来自新类的可用标记目标。例如,一些FSOD方法的1- shot性能不到5-shot 或10-shot性能的一半,如图1所示。对于隐式样本,使用在大规模图像分类数据集上预训练的模型初始化主干网络是训练目标检测器的常见做法。然而,分类数据集包含许多与新类重叠的目标类的隐式样本。因此检测器可以早期访问新类,并将它们的知识编码到主干的参数中。从预训练数据集中删除这些隐式样本也会对性能产生负面影响,如图1所示。在处理现实世界中的极端情况时,显式和隐式样本的变化可能会导致系统故障。
Figure 1. FSOD performance (mAP50) on VOC [13] Novel Set 1 at different shot numbers. Solid line (original) means the pretrained model used for initializing the detector backbone is trained on the original ImageNet [10]. Dashed line (rm-nov) means classes in Novel Set 1 are removed from the ImageNet for the pretrained backbone model. Our SRR-FSD is more stable to the variation of explicit shots (x-axis) and implicit shots (original vs. rm-nov). 不同样本量下VOC[13]新装置1的FSOD性能(mAP50)。实线(原始)是指用于初始化检测器主干的预训练模型在原始ImageNet上进行训练[10]。虚线(rm nov)表示,对于预训练主干模型,新集合1中的类将从ImageNet中删除。我们的SRR-FSD对于显式样本(x轴)和隐式样本(original vs. rm-nov)的变化更稳定。
We believe the reason for shot sensitivity is due to exclusive dependence on the visual information. Novel objects are learned through images only and the learning is independent between classes. As a result, visual information becomes limited as image data becomes scarce. However, one thing remains constant regardless of the availability of visual information, i.e. the semantic relation between base and novel classes. For example in Figure 2, if we have the prior knowledge that the novel class “bicycle” looks similar to “motorbike”, can have interaction with “person”, and can carry a “bottle”, it would be easier to learn the concept “bicycle” than solely using a few images. Such explicit relation reasoning is even more crucial when visual information is hard to access [40].
我们认为样本敏感度的原因是对视觉信息的完全依赖。新的物体只通过图像学习,并且学习是独立于类之间的。因此,随着图像数据变得稀缺,视觉信息变得有限。然而,不管视觉信息的可用性如何,有一点是不变的,即基本类和新类之间的语义关系。例如,在图2中,如果我们事先知道新类“自行车”看起来类似于“摩托车”,可以与“人”交互,并且可以携带“瓶子”,那么学习“自行车”的概念将比仅仅使用几个图像更容易。当视觉信息难以访问时,这种显式关系推理更为关键[40]。
Figure 2. Key insight: the semantic relation between base and novel classes is constant regardless of the data availability of novel classes, which can aid the learning together with visual information. 关键洞察:无论新类的数据可用性如何,基本类和新类之间的语义关系都是恒定的,这可以与视觉信息一起帮助学习。
So how can we introduce semantic relation to few-shot detection? In natural language processing, semantic concepts are represented by word embeddings [27, 31] from language models, which have been used in zero-shot learning methods [40, 1]. And explicit relationships are represented by knowledge graphs [28, 4], which are adopted by some zero-shot or few-shot recognition algorithms [40, 30]. However, these techniques are rarely explored in the FSOD task. Also, directly applying them to few-shot detectors leads to non-trivial practical problems, i.e. the domain gap between vision and language, and the heuristic definition of knowledge graph for classes in FSOD datasets (see Section 3.2 and 3.3 for details).
那么如何将语义关系引入到小样本检测中呢?在自然语言处理中,语义概念由语言模型中的单词嵌入[27,31]表示,这些语言模型已用于zero-shot学习方法[40,1]。明确的关系由知识图[28,4]表示,这些知识图被一些零样本或少样本识别算法所采用[40,30]。然而,在FSOD任务中很少探讨这些技术。此外,将其直接应用于少样本检测器会导致非平凡的实际问题,即视觉和语言之间的领域差距,以及FSOD数据集中类的知识图的启发式定义(详见第3.2节和第3.3节)。
In this work, we explore the semantic relation for FSOD. We propose a Semantic Relation Reasoning Few-Shot Detector (SRR-FSD), which learns novel objects from both the visual information and the semantic relation in an end-toend style. Specifically, we construct a semantic space using the word embeddings. Guided by the word embeddings of the classes, the detector is trained to project the objects from the visual space to the semantic space and to align their image representations with the corresponding class embeddings. To address the aforementioned problems, we propose to learn a dynamic relation graph driven by the image data instead of pre-defining one based on heuristics. Then the learned graph is used to perform relation reasoning and augment the raw embeddings for reduced domain gap.
在这项工作中,我们探讨了FSOD的语义关系。我们提出了一种语义关系推理小样本检测器(SRR-FSD),它以端到端的方式从视觉信息和语义关系中学习新目标。具体来说,我们使用单词嵌入来构建语义空间。在类的单词嵌入的指导下,检测器被训练将目标从视觉空间投影到语义空间,并将其图像表示与相应的类嵌入对齐。为了解决上述问题,我们建议学习由图像数据驱动的动态关系图,而不是基于启发式的预定义关系图。然后利用学习到的图进行关系推理,并增加原始嵌入以减少域间距。
With the help of the semantic relation reasoning, our SRR-FSD demonstrates the shot-stable property in two aspects, see the red solid and dashed lines in Figure 1. In the common few-shot settings (solid lines), SRR-FSD achieves competitive performance at higher shots and significantly better performance at lower shots compared to state-of-theart few-shot detectors. In a more realistic setting (dashed lines) where implicit shots of novel concepts are removed from the classification dataset for the pretrained model, SRR-FSD steadily maintains the performance while some previous methods have results degraded by a large margin due to the loss of implicit shots. We hope the suggested realistic setting can serve as a new benchmark protocol for future research.
借助语义关系推理,我们的SRR-FSD在两个方面展示了样本稳定特性,见图1中的红色实线和虚线。在常见的小样本设置(实线)中,SRR-FSD在较多样本条件下实现了具有竞争力的性能,而在较少样本条件下,SRR-FSD的性能明显优于现有的小样本检测器。在更现实的环境(虚线)中,从预训练模型的分类数据集中删除新概念的隐式样本,SRR-FSD稳定地保持性能,而以前的一些方法由于隐式样本的丢失,结果大幅下降。我们希望建议的现实设置可以作为未来研究的新基准协议。
We summarize our contributions as follows:
? To our knowledge, our work is the first to investigate semantic relation reasoning for the few-shot detection task and show its potential to improve a strong baseline.
? Our SRR-FSD achieves stable performance w.r.t the shot variation, outperforming state-of-the-art FSOD methods under several existing settings especially when the novel class data is extremely limited.
? We suggest a more realistic FSOD setting in which implicit shots of novel classes are removed from the classification dataset for the pretrained model, and show that our SRR-FSD can maintain a more steady performance compared to previous methods if using the new pretrained model.
我们的贡献总结如下:
?据我们所知,我们的工作是第一次调查小样本检测任务的语义关系推理,并显示其改善强基线的潜力。
?我们的SRR-FSD实现了稳定的性能变化,在几种现有设置下优于最先进的FSOD方法,尤其是在新类数据极其有限的情况下。
?我们建议更现实的FSOD设置,即从预训练模型的分类数据集中删除新类的隐式样本,并表明如果使用新的预训练模型,与以前的方法相比,我们的SRR-FSD可以保持更稳定的性能。
2 相关工作
Object Detection Object detection is a fundamental computer vision task, serving as a necessary step for various down-streaming instance-based understanding. Modern CNN-based detectors can be roughly divided into two categories. One is single-stage detector such as YOLO [34], SSD [26], RetinaNet [24], and FreeAnchor [47] which directly predict the class confidence scores and the bounding box coordinates over a dense grid. The other is multi-stage detector such as Faster R-CNN [35], R-FCN [9], FPN [23], Cascade R-CNN [2], and Libra R-CNN [29] which predict class-agnostic regions of interest and refine those region proposals for one or multiple times. All these methods rely on pre-defined anchor boxes to have an initial estimation of the size and aspect ratio of the objects. Recently, anchorfree detectors eliminate the performance-sensitive hyperparameters for the anchor design. Some of them detect the key points of bounding boxes [22, 48, 12]. Some of them encode and decode the bounding boxes as anchor points and point-to-boundary distances [38, 50, 36, 45, 49]. DETR [3] reformulates object detection as a direct set prediction problem and solve it with transformers. However, these detectors are trained with full supervision where each class has abundant annotated object instances.
目标检测目标检测是一项基本的计算机视觉任务,是各种基于实例的理解的必要步骤。现代基于CNN的检测器大致可分为两类。一种是单级检测器,如YOLO[34]、SSD[26]、RetinaNet[24]和FreeAnchor[47],它们直接预测密集网格上的类置信分数和边界框坐标。另一种是多阶段检测器,如更快的R-CNN[35]、R-FCN[9]、FPN[23]、Cascade R-CNN[2]和Libra R-CNN[29],它们预测感兴趣的类不可知区域,并将这些区域建议细化一次或多次。所有这些方法都依赖于预定义的锚框来对目标的大小和纵横比进行初始估计。最近,anchorfree检测器消除了锚框设计的性能敏感超参数。其中一些检测边界框的关键点[22,48,12]。其中一些对边界框进行编码和解码,作为定位点和点到边界的距离[38,50,36,45,49]。DETR[3]将目标检测重新定义为一个直接集预测问题,并使用变压器进行解决。然而,这些检测器是在完全监督下训练的,每个类都有大量的注释目标实例。
Few-Shot Detection Recently, there have been works focusing on solving the detection problem in the limited data scenario. LSTD [5] proposes the transfer knowledge regularization and background depression regularization to promote the knowledge transfer from the source domain to the target domain. [11] proposes to iterate between model training and high-confidence sample selection. RepMet [20] adopts a distance metric learning classifier into the RoI classification head. FSRW [19] and Meta R-CNN [44] predict per-class attentive vectors to reweight the feature
maps of the corresponding classes. MetaDet [41] leverages meta-level knowledge about model parameter generation for category-specific components of novel classes. In [14], the similarity between the few shot support set and query set is explored to detect novel objects. Context-Transformer [46] relies on discriminative context clues to reduce object confusion. TFA [39] only fine-tunes the last few layers of the detector. Two very recent papers are MPSR [42] and FSDetView [43]. MPSR develops an auxiliary branch to generate multi-scale positive samples as object pyramids and to refine the prediction at various scales. FSDetView proposes a joint feature embedding module to share the feature from base classes. However, all these methods depend purely on visual information and suffer from shot variation.
近几年来,人们一直致力于解决小样本场景下的检测问题。LSTD[5]提出了转移知识正则化和背景抑制正则化,以促进知识从源域转移到目标域。[11] 建议在模型训练和高置信度样本选择之间进行迭代。RepMet[20]在RoI分类头中采用了一个距离度量学习分类器。FSRW[19]和Meta R-CNN[44]预测每类注意向量,以重新加权相应类的特征图。MetaDet[41]利用了关于新类的类别特定组件的模型参数生成的元级知识。在[14]中,探索了小样本支持集和查询集之间的相似性,以检测新对象。上下文转换器[46]依靠区分性上下文线索来减少对象混淆。TFA[39]仅微调探测器的最后几层。最近的两篇论文是MPSR[42]和FSDetView[43]。MPSR开发了一个辅助分支,以生成多尺度正样本作为目标金字塔,并在不同尺度上细化预测。FSDetView提出了一个联合特征嵌入模块来共享来自基类的特征。然而,所有这些方法都完全依赖于视觉信息,并且受到样本变化的影响。
Semantic Reasoning in Vision Tasks Semantic word embeddings have been used in zero-shot learning tasks to learn a mapping from the visual feature space to the semantic space, such as zero-shot recognition [40] and zero-shot object detection [1, 32]. In [7], semantic embeddings are used as the ground-truth of the encoder TriNet to guide the feature augmentation. In [15], semantic embeddings guide the feature synthesis for unseen classes by perturbing the seen feature with the projected difference between a seen class embedding and a unseen class embedding. In zeroshot or few-shot recognition [40, 30], word embeddings are often combined with knowledge graphs to perform relation reasoning via the graph convolution operation [21]. Knowledge graphs are usually defined based on heuristics from databases of common sense knowledge rules [28, 4]. [8] proposed a knowledge graph based on object co-occurrence for the multi-label recognition task. To our knowledge, the use of word embeddings and knowledge graphs are rarely explored in the FSOD task. Any-Shot Detector (ASD) [33] is the only work that uses word embeddings for the FSOD task. But ASD focuses more on the zero-shot detection and it does not consider the explicit relation reasoning between classes because each word embedding is treated independently.
视觉任务中的语义推理语义单词嵌入已用于zero-shot学习任务中,以学习从视觉特征空间到语义空间的映射,如零样本识别[40]和零样本目标检测[1,32]。在[7]中,语义嵌入被用作编码器TriNet的基本事实,以指导特征增强。在[15]中,语义嵌入通过使用可见类嵌入和不可见类嵌入之间的投影差异扰动可见特征来指导不可见类的特征合成。在零样本或少样本识别[40,30]中,单词嵌入通常与知识图相结合,通过图卷积运算执行关系推理[21]。知识图通常是基于常识知识规则数据库中的启发法定义的[28,4]。[8] 针对多标签识别任务,提出了一种基于目标共现的知识图。据我们所知,在FSOD任务中很少探讨单词嵌入和知识图的使用。Any-Shot 探测器(ASD)[33]是唯一使用单词嵌入完成FSOD任务的工作。但是ASD更多地关注零样本检测,它不考虑类之间的显式关系推理,因为每个单词嵌入是独立处理的。
3. Semantic Relation Reasoning Few-Shot Detector
In this section, we first briefly introduce the preliminaries for few-shot object detection including the problem setup and the general training pipelines. Then based on Faster R-CNN [35], we build our SRR-FSD by integrating semantic relation with the visual information and allowing it to perform relation reasoning in the semantic space. We also discuss the problems of trivially using the raw word embeddings and the predefined knowledge graphs. Finally, we introduce the two-phase training processes. An overview of our SRR-FSD is illustrated in Figure 3.
在本节中,我们首先简要介绍了小样本目标检测的准备工作,包括问题设置和一般训练方法。然后,基于Faster R-CNN[35],我们通过将语义关系与视觉信息集成,并允许其在语义空间中执行关系推理,构建了SRR-FSD。我们还讨论了简单使用原始单词嵌入和预定义知识图的问题。最后,我们介绍了两个阶段的训练过程。我们的SRR-FSD概述如图3所示。
Figure 3. Overview of the SRR-FSD. A semantic space is built from the word embeddings of all corresponding classes in the dataset and is augmented through a relation reasoning module. Visual features are learned to be projected into the augmented space. “N”: dot product. “FC”: fully-connected layer. “P”: lernable projection matrix. 语义空间由数据集中所有对应类的单词嵌入构建,并通过关系推理模块进行扩充。视觉特征被学习投射到增强空间中。“N”:点积。“FC”:全连接层。“P”:可编辑的投影矩阵。
3.1. FSOD Preliminaries
Conventional object detection problem has a base class set Cb in which there are many instances, and a base dataset Db with abundant images. Db consists of a set of annotated images f(xi, yi)g where xi is the image and yi is the annotation of labels from Cb and bounding boxes for objects in xi. For few-shot object detection (FSOD) problem, in addition to Cb and Db it also has a novel class set Cn and a novel dataset Dn, with Cb ∩ Cn = φ;. In Dn, objects have labels belong to Cn and the number of objects for each class is k for k-shot detection. A few-shot detector is expected to learn from Db and to quickly generalize to Dn with a small k such that it can detect all objects in a held-out testing set with object classes in Cb ∪ Cn. We assume all classes in Cb ∪ Cn have semantically meaningful names so the corresponding semantic embeddings can be retrieved.
传统的目标检测问题有一个包含多个实例的基类集Cb和一个包含大量图像的基类数据集Db。Db由一组注释图像{(Xi,Yi)}组成,其中Xi是图像,而Yi是对Xi中的目标的Cb和边界框的标签的注释。对于小样本目标检测(FSOD)问题,除了Cb和Db之外,它还具有一个新的类集Cn和一个新的数据集Dn,带有Cb∩ Cn=φ;。在Dn中,目标具有属于Cn的标签,对于k-shot检测,每个类的目标数为k。小样本检测器想要从Db中学习,并快速推广到具有小k的Dn,这样它就可以检测出Cb ∪ Cn中目标类的测试集中的所有目标。我们假设Cb∪Cn中的所有类具有语义上有意义的名称,因此可以检索相应的语义嵌入。
A typical few-shot detector has two training phases. The first one is the base training phase where the detector is trained on Db similarly to conventional object detectors. Then in the second phase, it is further fine-tuned on the union of Db and Dn. To avoid the dominance of objects from Db, a small subset is sampled from Db such that the training set is balanced concerning the number of objects per class. As the total number of classes is increased by the size of Cn in the second phase, more class-specific parameters are inserted in the detector and trained to be responsible for the detection of novel objects. The class-specific parameters are usually in the box classification and localization layers at the very end of the network.
典型的小样本检测器有两个训练阶段。第一个阶段是基本训练阶段,在此阶段,检测器在Db上进行训练,类似于传统的目标检测器。然后在第二阶段,在Db和Dn的并集上进一步进行微调。为了避免Db中的目标占主导地位,从Db中抽取一个小的子集,这样训练集就平衡了每个类中目标的数量。在第二阶段,随着类别总数随着Cn的大小而增加,检测器中会插入更多特定于类别的参数,并训练这些参数来负责新目标的检测。特定于类的参数通常位于网络最末端的框分类和标准化层中。
3.2. Semantic Space Projection
Our few-shot detector is built on top of Faster R-CNN [35], a popular two-stage general object detector. In the second-stage of Faster R-CNN, a feature vector is extracted for each region proposal and forwarded to a classification subnet and a regression subnet. In the classification subnet, the feature vector is transformed into a d-dimentional vector v ∈ Rd through fully connected layers. Then v is multiplied by a learnable weight matrix W ∈ RN×d to output a probability distribution as in Eq. (1).
我们的小样本检测器是建立在 Faster R-CNN[35]之上的,后者是一种流行的两级通用目标探测器。在 Faster R-CNN的第二阶段,为每个区域建议框提取一个特征向量,并将其转发到分类子网和回归子网。在分类子网中,将特征向量转换为d维向量v∈Rd通过全连接层。然后v乘以可学习的权重矩阵W∈ RN×d以输出公式(1)中的概率分布。
where N is the number of classes and b 2 RN is a learnable bias vector. Cross-entropy loss is used during training.
其中N是类的数量,b2rn是可学习的偏差向量。在训练过程中使用交叉熵损失。
To learn objects from both the visual information and the semantic relation, we first construct a semantic space and project the visual feature v into this semantic space. Specifically, we represent the semantic space using a set of de-dimensional word embeddings [27] corresponding to the N object classes (including the background class). And the detector is trained to learn a linear projectionin the classification subnet (see Figure 3) such that v is expected to align with its class’s word embedding after projection. Mathematically, the prediction of the probability distribution turns into Eq. (2) from Eq. (1).
为了从视觉信息和语义关系中学习对象,我们首先构建一个语义空间,并将视觉特征v投射到该语义空间中。具体地说,使用一组无量纲单词嵌入来表示语义空间[27]对应于N个目标类(包括背景类)。然后训练检测器学习分类子网中的线性投影(参见图3),以便v在投影后与其类的单词嵌入对齐。从数学上讲,概率分布的预测从式(1)变成式(2)。
During training, We is fixed and the learnable variable is P. A benefit is that generalization to novel objects involves no new parameters in P. We can simply expand We with embeddings of novel classes. We still keep the b to model the category imbalance in the detection dataset.
在训练过程中,We是固定的,可学习变量是P。一个好处是对新目标的泛化不涉及P中的新参数。我们可以简单地通过嵌入新类来扩展We。我们仍然保留b来模拟检测数据集中的类别不平衡。
Domain gap between vision and language. We encodes the knowledge of semantic concepts from natural language. While it is applicable in zero-shot learning, it will introduce the bias of the domain gap between vision and language to the FSOD task. Because unlike zero-shot learning where unseen classes have no support from images, the few-shot detector can rely on both the images and the embeddings to learn the concept of novel objects. When there are very few images to rely on, the knowledge from embeddings can guide the detector towards a decent solution. But when more images are available, the knowledge from embeddings may be misleading due to the domain gap, resulting in a suboptimal solution. Therefore, we need to augment the semantic embeddings to reduce the domain gap. Some previous works like ASD [33] apply a trainable transformation to each word embedding independently. But we leveraging the explicit relationship between classes is more effective for embedding augmentation, leading to the proposal of the dynamic relation graph in Section 3.3.
视觉和语言之间的领域鸿沟。我们从自然语言中编码语义概念的知识。虽然它适用于zero-shot学习,但它会将视觉和语言之间的领域差距偏差引入FSOD任务。因为与zero-shot学习不同,zero-shot学习中看不见的类没有来自图像的支持,所以小样本检测器可以同时依赖图像和嵌入来学习新目标的概念。当需要依赖的图像很少时,来自嵌入的知识可以引导检测器找到一个合适的解决方案。但是,当更多的图像可用时,由于域间隙,嵌入的知识可能会产生误导,从而导致次优解决方案。因此,我们需要增加语义嵌入以减少领域差距。以前的一些工作,如ASD[33]将可训练的转换独立地应用于每个单词的嵌入。但是我们利用类之间的显式关系更有效地嵌入扩充,从而在第3.3节中提出了动态关系图。
3.3. Relation Reasoning
The semantic space projection learns to align the concepts from the visual space with the semantic space. But it still treats each class independently and there is no knowledge propagation among classes. Therefore, we further introduce a knowledge graph to model their relationships. The knowledge graph G is a N ×N adjacency matrix representing the connection strength for every neighboring class pairs. G is involved in classification via the graph convolution operation [21]. Mathematically, the updated probability prediction is shown in Eq. (3).
语义空间投影学习将视觉空间中的概念与语义空间对齐。但是它仍然独立地处理每个类,并且没有在类之间进行知识传播。因此,我们进一步引入知识图来建模它们之间的关系。知识图G是一个N×N邻接矩阵,表示每个相邻类对的连接强度。G通过图卷积运算参与分类[21]。数学上,更新的概率预测如式(3)所示。
The heuristic definition of the knowledge graph. In zero-shot or few-shot recognition algorithms, the knowledge graph G is predefined base on heuristics. It is usually constructed from a database of common sense knowledge rules by sampling a sub-graph through the rule paths such that semantically related classes have strong connections. For example, classes from the ImageNet dataset [10] have a knowledge graph sampled from the WordNet [28]. However, classes in FSOD datasets are not highly semantically related, nor do they form a hierarchical structure like the ImageNet classes. The only applicable heuristics we found are based on object co-occurrence from [8]. Although the statistics of the co-occurrence are straightforward to compute, the co occurrence is not necessarily equivalent to semantic relation.
知识图的启发式定义 在零样本或少样本识别算法中,知识图G是基于启发式算法预定义的。它通常是从一个常识知识规则数据库中构造出来的,通过规则路径对一个子图进行采样,使得语义相关的类具有强连接。例如,ImageNet数据集[10]中的类具有从WordNet[28]中采样的知识图。然而,FSOD数据集中的类在语义上不是高度相关的,也不像ImageNet类那样形成层次结构。我们发现的唯一适用的启发式方法基于[8]中的目标共现。虽然共现的统计数据很容易计算,但共现并不一定等同于语义关系。
Instead of predefining a knowledge graph based on heuristics, we propose to learn a dynamic relation graph driven by the data to model the relation reasoning between classes. The data-driven graph is also responsible for reducing the domain gap between vision and language because it is trained with image inputs. Inspired by the concept of the transformer, we implement the dynamic graph with the selfattention architecture [37] as shown in Figure 4. The original word embeddings We are transformed by three linear layers f; g; h, and a self-attention matrix is computed from the outputs of f; g. The self-attention matrix is multiplied with the output of h followed by another linear layer l. A residual connection [16] adds the output of l with the original We. Another advantage of learning a dynamic graph is that it can easily adapt to new coming classes. Because the graph is not fixed and is generated on the fly from the word embeddings. We do not need to redefine a new graph and retrain the detector from the beginning. We can simply insert corresponding embeddings of new classes and fine-tune the detector.
我们建议学习由数据驱动的动态关系图来建模类之间的关系推理,而不是预先定义基于启发式的知识图。数据驱动图还负责缩小视觉和语言之间的领域差距,因为它是通过图像输入进行训练的。受迁移概念的启发,我们使用selfattention架构实现动态图[37],如图4所示。原始单词嵌入We由三个线性层f,g,h转换而成,同时由f,g输出计算出一个selfattention矩阵。selfattention矩阵乘以h的输出,然后再乘以另一个线性层l。剩余连接[16]将l的输出与原始We相加。学习动态图的另一个优点是它可以很容易地适应新的类别。因为图形不是固定的,而是从单词嵌入动态生成的。我们不需要重新定义一个新的图,也不需要从头重新训练检测器。我们可以简单地插入新类的相应嵌入并微调检测器。
Figure 4. Network architecture of the relation reasoning module for learning the relation graph. “N”: dot product. “L”: elementwise plus. 用于学习关系图的关系推理模块的网络结构。“N”:点积。“L”:元素加。
3.4. Decoupled Fine-tuning
In the second fine-tuning phase, we only unfreeze the last few layers of our SRR-FSD similar to TFA [39]. For the classification subnet, we fine-tune the parameters in the relation reasoning module and the projection matrix P. For the localization subnet, it is not dependent on the word embeddings but it shares features with the classification subnet. We find that the learning of localization on novel objects can interfere with the classification subnet via the shared features, leading to many false positives. Decoupling the shared fully-connected layers between the two subnets can effectively make each subnet learn better features for its task. In other words, the classification subnet and the localization subnet have individual fully-connected layers and they are fine-tuned independently.
在第二个微调阶段,我们仅解冻SRR-FSD的最后几层,类似于TFA[39]。对于分类子网,我们微调了关系推理模块和投影矩阵P中的参数。对于本地化子网,它不依赖于单词嵌入,而是与分类子网共享特征。我们发现,新目标的定位学习会通过共享特征干扰分类子网,导致许多误报。分离两个子网之间共享的完全连接层可以有效地使每个子网为其任务学习更好的特性。换句话说,分类子网和本地化子网具有单独的全连接层,并且它们是独立微调的。
4 实验
4.1. Implementation Details
Our SRR-FSD is implemented based on Faster R-CNN [35] with ResNet-101 [16] and Feature Pyramid Network [23] as the backbone using the MMDetection [6] framework. All models are trained with Stochastic Gradient Descent (SGD) and a batch size of 16. For the word embeddings, we use the L2-normalized 300-dimensional Word2Vec [27] vectors from the language model trained on large unannotated texts like Wikipedia. In the relation reasoning module, we reduce the dimension of word embeddings to 32 which is empirically selected. In the first base training phase, we set the learning rate, the momentum, and the weight decay to 0.02, 0.9, and 0.0001, respectively. In the second fine-tuning phase, we reduce the learning rate to 0.001 unless otherwise mentioned. The input image is sampled by first randomly choosing between the base set and the novel set with a 50% probability and then randomly selecting an image from the chosen set.
我们的SRR-FSD是基于Faster R-CNN[35]实现的,它以ResNet-101[16]和特征金字塔网络[23]为主干,使用MMDetection[6]框架。所有模型均采用随机梯度下降法(SGD)进行训练,batch size为16。对于单词嵌入,我们使用L2规范化的300维Word2Vec[27]向量,该向量来自在大型未注文本(如Wikipedia)上训练的语言模型。在关系推理模块中,我们按经验将单词嵌入维度降低到32。在第一个基础训练阶段,我们将学习率、动量和权重衰减分别设置为0.02、0.9和0.0001。在第二个微调阶段,除非另有说明,否则我们将学习率降低到0.001。首先以50%的概率在基本集和新集之间随机选择,然后从所选集随机选择图像,从而对输入图像进行采样。
4.2. Existing Settings
We follow the existing settings in previous FSOD methods [19, 41, 44, 39] to evaluate our SRR-FSD on the VOC [13] and COCO [25] datasets. For fair comparison and reduced randomness, we use the same data splits and a fixed list of novel samples provided by [19].
我们遵循先前FSOD方法[19,41,44,39]中的现有设置,在VOC[13]和COCO[25]数据集上评估我们的SRR-FSD。为了公平比较和减少随机性,我们使用相同的数据分割和[19]提供的固定新样本列表。
VOC The 07 and 12 train/val sets are used for training and the 07 test set is for testing. Out of its 20 object classes,5 classes are selected as novel and the remaining 15 are base classes, with 3 different base/novel splits. The novel classes each have k annotated objects, where k equals 1, 2, 3, 5, 10. In the first base training phase, our SRR-FSD is trained for 18 epochs with the learning rate multiplied by 0.1 at the 12th and 15th epoch. In the second fine-tuning phase, we train for 500 × |Dn|steps where |Dn| is the number of images in the k-shot novel dataset.
VOC 07和12 训练/评估集用于训练,07测试集用于测试。在其20个目标类中,有5个类被选为新类,其余15个是基类,有3个不同的基类/新类拆分。新类都有k个注释对象,其中k等于1,2,3,5,10。在第一个基础训练阶段,我们的SRR-FSD训练了18个阶段,学习率在第12和第15个阶段乘以0.1。在第二个微调阶段,我们训练500× |Dn|步骤,其中 |Dn|是k-shot新数据集中的图像数。
We report the mAP50 of the novel classes on VOC with 3 splits in Table 1. In all different base/novel splits, our SRRFSD achieves a more shot-stable performance. At higher shots like 5-shot and 10-shot, our performance is competitive compared to previous state-of-the-art methods. At more challenging conditions with shots less than 5, our approach can outperform the second-best by a large margin (up to 10+ mAP). Compared to ASD [33] which only reports results of 3 shot and 5-shot in the Novel Set 1, ours is 24.2 and 6.0 better respectively in mAP. We do not include ASD in Table 1 because its paper does not provide the complete results on VOC.
我们在表1中报告了具有3个拆分的VOC新类别的mAP50。在所有不同的基本/新拆分中,我们的SRRFSD实现了更稳定的性能。在5 shot 和10 shot 等更高的拍摄中,我们的表现与之前最先进的方法相比具有竞争力。在样本数少于5的更具挑战性的条件下,我们的方法可以大幅度超越次优(最多10次mAP)。与ASD[33]相比,ASD[33]在新的集合1中只报告了3shot 和5shot 的结果,我们的mAP分别为24.2和6.0。我们没有将ASD包括在表1中,因为其论文没有提供VOC的完整结果。
Learning without forgetting is another merit of our SRRFSD. After generalization to novel objects, the performance on the base objects does not drop at all as shown in Table 2. Both base AP and novel AP of our SRR-FSD compare favorably to previous methods based on the same Faster RCNN with ResNet-101. The base AP even increases a bit probably due to the semantic relation reasoning from limited novel objects to base objects.
学习但不忘记是我们SRRFSD的另一个优点。在泛化到新目标之后,基本目标的性能不会下降,如表2所示。我们的SRR-FSD的基本AP和新AP都优于以前基于与ResNet-101相同的Faster RCNN的方法。基本AP甚至有点增加,可能是由于从有限的新目标到基本目标的语义关系推理。
COCO The minival set with 5000 images is used for testing and the rest images in train/val sets are for training. Out of the 80 classes, 20 of them overlapped with VOC are the novel classes with k = 10, 30 shots per class and the remaining 60 classes are base. We train the SRR-FSD on the base dataset for 12 epochs using the same setting as MMDetection [6] and fine-tune it for a fixed number of 10 × |Db| steps where |Db| is the number of images in the base dataset. Unlike VOC, the base dataset in COCO contains unlabeled novel objects, so the region proposal network (RPN) treats them as the background. To avoid omitting novel objects in the fine-tuning phase, we unfreeze the RPN and the following layers. Table 3 presents the COCO-style averaged AP. Again we consistently outperform previous methods including FSRW [19], MetaDet [41], Meta R-CNN [44], TFA [39], and MPSR [42].
COCO 具有5000个图像的minival集用于测试,train/val集中的其余图像用于训练。在80个类别中,与VOC重叠的20个类别是k=10的新类别,每个类别30张,其余60个类别是基本类别。我们使用与MMDetection[6]相同的设置,在基础数据集上对SRR-FSD进行12个时期的训练,并将其微调为固定数量的10×|Db|步骤,其中|Db|是基础数据集中的图像数量。与VOC不同,COCO中的基本数据集包含未标记的新目标,因此区域建议网络(RPN)将其作为背景。为了避免在微调阶段忽略新目标,我们解冻RPN和以下层。表3显示了COCO风格的平均AP。同样,我们的性能始终优于先前的方法,包括FSRW[19]、MetaDet[41]、Meta R-CNN[44]、TFA[39]和MPSR[42]。
COCO to VOC For the cross-domain FSOD setting, we follow [19, 41] to use the same base dataset with 60 classes as in the previous COCO within-domain setting. The novel dataset consists of 10 samples for each of the 20 classes from the VOC dataset. The learning schedule is the same as the previous COCO within-domain setting except the learning rate is 0.005. Figure 5 shows that our SRR-FSD achieves the best performance with a healthy 44.5 mAP, indicating better generalization ability in cross-domain situations.
COCO到VOC 对于跨域FSOD设置,我们按照[19,41]使用与前一个COCO域内设置相同的60类基本数据集。新的数据集由来自VOC数据集的20个类中的每个类的10个样本组成。除学习率为0.005外,学习计划与之前的COCO域内设置相同。图5显示,我们的SRR-FSD在44.5MAP正常的情况下实现了最佳性能,表明在跨域情况下具有更好的泛化能力。
4.3. A More Realistic Setting
The training of the few-shot detector usually involves initializing the backbone network with a model pretrained on large-scale object classification datasets such as ImageNet [10]. The set of object classes in ImageNet, i.e. C0, is highly overlapped with the novel class set Cn in the existing settings. This means that the pretrained model can get early access to large amounts of object samples, i.e. implicit shots, from novel classes and encode their knowledge in the parameters before it is further trained for the detection task. Even the pretrained model is optimized for the recognition task, the extracted features still have a big impact on the detection of novel objects (see Figure 1). However, some rare classes may have highly limited or valuable data in the real world that pretraining a classification network on it is not realistic.
小样本检测器的训练通常涉及使用在大规模目标分类数据集(如ImageNet)上预训练的模型初始化主干网络[10]。ImageNet中的目标类集,即C0,与现有设置中的新类集Cn高度重叠。这意味着预训练模型可以从新类中提前获取大量目标样本,即隐式样本,并在进一步训练用于检测任务之前将其知识编码到参数中。即使针对识别任务对预训练模型进行了优化,提取的特征仍然对新目标的检测有很大影响(见图1)。然而,在现实世界中,一些稀有类可能具有非常有限或有价值的数据,因此在其上预先训练分类网络是不现实的。
Therefore, we suggest a more realistic setting for FSOD, which extends the existing settings. In addition to Cb ∩ Cn = φ, we also require that C0 ∪ Cn = φ. To achieve this, we systematically and hierarchically remove novel classes from C0. For each class in Cn, we find its corresponding synset in ImageNet and obtain its full hyponym (the synset of the whole subtree starting from that synset) using the ImageNet API 1. The images of this synset and its full hyponym are removed from the pretrained dataset. And the classification model is trained on a dataset with no novel objects. We provide the list of WordNet IDs for each novel class to be removed in Appendix A.
因此,我们建议为FSOD设置一个更现实的设置,它扩展了现有设置。除了Cb∩ Cn=φ,我们还要求C0∪ Cn=φ。为了实现这一点,我们系统地、分层地从C0中删除新类。对于Cn中的每个类,我们在ImageNet中找到其对应的语法集,并使用ImageNet API 1获得其完整的下义词(从该语法集开始的整个子树的语法集)。从预训练数据集中删除此语法集及其完整下义词的图像。分类模型是在没有新目标的数据集上训练的。我们在附录A中提供了要删除的每个新类的WordNet ID列表。
We notice that CoAE [18] also proposed to remove all COCO-related ImageNet classes to ensure the model does not “foresee” the unseen classes. As a result, a total of 275 classes are removed from ImageNet including both the base and novel classes in VOC [13], which correspond to more than 300k images. We think the loss of this much data may lead to a worse pretrained model in general. So the pretrained model may not be able to extract features strong enough for down-streaming vision tasks compared with the model trained on full ImageNet. Our setting, on the other hand, tries to alleviate this effect as much as possible by only removing the novel classes in VOC Novel Set 1, 2, and 3 respectively, which correspond to an average of 50 classesfrom ImageNet.
我们注意到CoAE[18]还建议删除所有与COCO相关的ImageNet类,以确保模型不会“预见”不可见的类。因此,ImageNet总共删除了275个类,包括VOC[13]中的基本类和新类,它们对应于超过300k的图像。我们认为,丢失这么多数据可能会导致更糟糕的预训练模型。因此,与在完整的ImageNet上训练的模型相比,预训练的模型可能无法提取足够强的特征来执行下行视觉任务。另一方面,我们的设置试图通过分别删除VOC Novel Set 1、2和3中的新类来尽可能地减轻这种影响,它们对应于ImageNet中平均50个类。
Under the new realistic setting, we re-evaluate previous methods using their official source code and report the performance on the VOC dataset in Table 4. Our SRR-FSD demonstrates superior performance to other methods under most conditions, especially at challenging lower shot scenarios. More importantly, our SRR-FSD is less affected by the loss of implicit shots. Compared with results in Table 1, our performance is more stably maintained when novel objects are only available in the novel dataset.
在新的现实设置下,我们使用其官方源代码重新评估了以前的方法,并在表4中报告了VOC数据集的性能。我们的SRR-FSD在大多数情况下,尤其是在具有挑战性的小样本场景下,表现出优于其他方法的性能。更重要的是,我们的SRR-FSD受隐式样本丢失的影响较小。与表1中的结果相比,当新目标仅在新数据集中可用时,我们的性能更稳定。
4.4. Ablation Study
In this section, we study the contribution of each component. Experiments are conducted on the VOC dataset. Our baseline is the Faster R-CNN [35] with ResNet-101 [16] and FPN [23]. We gradually apply the Semantic Space Projection (SSP 3.2), Relation Reasoning (RR 3.3) and Decoupled Fine-tuning (DF 3.4) to the baseline and report the performance in Table 5. We also compare three different ways of augmenting the raw word embeddings in Table 6, including the trainable transformation from ASD [33], the heuristic knowledge graph from [8], and the dynamic graph from our proposed relation reasoning module.
在本节中,我们将研究每个部分的贡献。在VOC数据集上进行了实验。我们的基线是使用ResNet-101[16]和FPN[23]的Faster R-CNN[35]。我们逐步将语义空间投影(SSP 3.2)、关系推理(RR 3.3)和解耦微调(DF 3.4)应用于基线,并在表5中报告性能。我们还比较了表6中增加原始单词嵌入的三种不同方法,包括ASD[33]的可训练转换、[8]的启发式知识图和我们提出的关系推理模块的动态图。
Semantic space projection guides shot-stable learning. The baseline Faster R-CNN can already achieve satisfying results at 5-shot and 10-shot. But at 1-shot and 2-shot, performance starts to fall apart due to exclusive dependence on images. The semantic space projection, on the other hand, makes the learning more stable to the variation of shot numbers (see 1st and 2nd entries in Table 5). The spaceprojection guided by the semantic embeddings is learned well enough in the base training phase so it can be quickly adapted to novel classes with a few instances. We can observe a major boost at lower shot conditions compared to baseline, i.e. 7.9 mAP and 2.4 mAP gain at 1-shot and 2- shot respectively. However, the raw semantic embeddings limit the performance at higher shot conditions. The performance at 5-shot and 10-shot drops below the baseline. This verifies our argument about the domain gap between vision and language. At lower shots, there is not much visual information to rely on so the language information can guide the detector to a decent solution. But when more images are available, the visual information becomes more precise then the language information starts to be misleading. Therefore, we propose to refine the word embeddings for a reduced domain gap.
语义空间投影指导样本稳定学习。基线速度Faster R-CNN已经可以在5-shot和10-shot时获得令人满意的结果。但在1-shot和2-shot时,由于对图像的依赖,性能开始下降。另一方面,语义空间投影使学习对样本数量的变化更加稳定(见表5中的第一项和第二项)。在基本训练阶段,语义嵌入引导的空间投影学习得足够好,因此它可以快速适应具有少量实例的新类。我们可以观察到,与基线相比,在较少的样本条件下,会有较大的提升,即在1-shot和2-shot时,分别有7.9 mAP和2.4 mAP增益。然而,原始语义嵌入限制了较高快照条件下的性能。5-shot和10-shot的成绩低于基线。这验证了我们关于视觉和语言之间的领域差异的论点。在较少样本时,没有太多的视觉信息可依赖,因此语言信息可以引导探测器找到合适的解决方案。但是当更多的图像可用时,视觉信息变得更精确,然后语言信息开始产生误导。因此,我们建议改进单词嵌入以减少域间距。
Relation reasoning promotes adaptive knowledge propagation. The relation reasoning module explicitly learns a relation graph that builds direct connections between base classes and novel classes. So the detector can learn the novel objects using the knowledge of base objects besides the visual information. Additionally, the relation reasoning module also functions as a refinement to the raw word embeddings with a data-driven relation graph. Since the relation graph is updated with image inputs, the refinement tends to adapt the word embeddings for the vision domain. Results in Table 5 (2nd and 3rd entries) confirm that applying relation reasoning improves the detection accuracy of novel objects under different shot conditions. We also compare it with two other ways of refining the raw word embeddings in Table 6. One is the trainable transformation (TT) from ASD [33] where word embeddings are updated with a trainable metric and a word vocabulary. Note that this transformation is applied to each embedding independently which does not consider the explicit relationships between them. The other one is the heuristic knowledge graph (HKG) defined based on the co-occurrence of objects from [8]. It turns out both the trainable transformation and the predefined heuristic knowledge graph are not as effective as the dynamic relation graph in the relation reasoning module. The effect of the trainable transformation is similar to unfreezing more parameters of the last few layers during fine-tuning as shown in Appendix E, which leads to overfitting when the shot is low. And the predefined knowledge graph is fixed during training thus cannot be adaptive to the inputs. In other words, the dynamic relation graph is better because it can not only perform explicit relation reasoning but also augment the raw embeddings for reduced domain gap between vision and language.
关系推理促进适应性知识传播。关系推理模块显式学习关系图,该关系图在基类和新类之间建立直接连接。因此,除了视觉信息外,检测器还可以利用基本目标的知识来学习新目标。此外,关系推理模块还可以通过数据驱动的关系图对原始单词嵌入进行细化。由于关系图是用图像输入更新的,因此细化倾向于使单词嵌入适应视觉领域。表5(第2条和第3条)中的结果证实,在不同的样本数条件下,应用关系推理可以提高新目标的检测精度。我们还将其与表6中精炼原始单词嵌入的其他两种方法进行了比较。一种是来自ASD[33]的可训练转换(TT),其中单词嵌入使用可训练度量和单词词汇表进行更新。请注意,这种变换独立地应用于每个嵌入,而不考虑它们之间的显式关系。另一种是基于[8]中对象的共现定义的启发式知识图(HKG)。结果表明,在关系推理模块中,可训练变换和预定义的启发式知识图都不如动态关系图有效。可训练变换的效果类似于在微调过程中解冻最后几层的更多参数,如附录E所示,这会在样本量较小时导致过度拟合。而预定义的知识图在训练过程中是固定的,因此不能适应输入。换句话说,动态关系图更好,因为它不仅可以执行显式关系推理,还可以增加原始嵌入,以减少视觉和语言之间的领域差距。
Decoupled fine-tuning reduces false positives. We analyze the false positives generated by our SRR-FSD with and without decoupled fine-tuning (DF) using the detector diagnosing tool [17]. The effect of DF on reducing the false positives in novel classes is visualized in Figure 6. It shows that most of the false positives are due to misclassification into similar categories. With DF, the classification subnet can be trained independently from the localization subnet to learn better features specifically for classification.
解耦微调减少了误报。我们使用检测器诊断工具分析SRR-FSD产生的误报,包括和不包括解耦微调(DF)[17]。图6显示了DF在减少新类中的误报方面的作用。它表明,大多数误报是由于误判为类似的类别。使用DF,分类子网可以独立于本地化子网进行训练,以学习更好的分类特征。
5 结论
In this work, we propose semantic relation reasoning for few-shot object detection. The key insight is to explicitly integrate semantic relation between base and novel classes with the available visual information, which can help to learn the novel concepts better especially when the novel class data is extremely limited. We apply the semantic relation reasoning to the standard two-stage Faster R-CNN and demonstrate robust few-shot performance against the variation of shot numbers. Compared to previous methods, our approach achieves state-of-the-art results on several few-shot detection settings, as well as a more realistic setting where novel concepts encoded in the pretrained backbone model are eliminated. We hope this realistic setting can be a better evaluation protocol for future few-shot detectors. Last but not least, the key components of our approach, i.e. semantic space projection and relation reasoning, can be straightly applied to the classification subnet of other few-shot detectors.
在这项工作中,我们提出了用于小样本目标检测的语义关系推理。关键是将基本类和新类之间的语义关系与可用的视觉信息显式集成,这有助于更好地学习新概念,尤其是在新类数据非常有限的情况下。我们将语义关系推理应用于标准的两阶段Faster R-CNN,并证明了针对样本数目变化变化的鲁棒小样本性能。与以前的方法相比,我们的方法在小样本检测设置上实现了最先进的结果,并且在一个更现实的设置中消除了在预训练主干模型中编码的新概念。我们希望这种现实的设置可以成为未来小样本检测器的更好的评估协议。最后但并非最不重要的是,我们的方法的关键部分,即语义空间投影和关系推理,可以直接应用于其他小样本检测器的分类子网。
附录
A. Removing Novel Classes from ImageNet
We propose a realistic setting for evaluating the fewshot object detection methods, where novel classes are completely removed from the classification dataset used for training a model to initialize the backbone network in the detector. This can guarantee that the object concept of novel classes will not be encoded in the pretrained model before training the few-shot detector. Because the novel class data is so rare in the real world that pretraining a classifier on it is not realistic.
我们提出了一个评估fewshot目标检测方法的现实设置,其中新类从用于训练模型以初始化检测器中主干网络的分类数据集中完全移除。这可以保证在训练小样本检测器之前,新类的对象概念不会被编码到预训练模型中。因为新的类数据在现实世界中非常罕见,所以在其上预先训练分类器是不现实的。
ImageNet [10] is widely used for pretraining the classification model. It has 1000 classes organized according to the WordNet hierarchy. Each class has over 1000 images for training. We systematically and hierarchically remove novel classes by finding each synset and its corresponding full hyponym (synset of the whole sub-tree starting from that synset) using the ImageNet API 2. So each novel class may contain multiple ImageNet classes.
ImageNet[10]广泛用于预训练分类模型。它有1000个根据WordNet层次结构组织的类。每类都有1000多张培训图片。我们通过使用ImageNet API 2查找每个语法集及其对应的完整下义词(从该语法集开始的整个子树的语法集),系统地、分层地删除新类。因此,每个新类可能包含多个ImageNet类。
For the novel classes in the COCO dataset [25], they are very common in the real world. Removing them from the ImageNet does not make sense as much as removing datascarce classes. So we suggest for large-scale datasets like COCO, we should follow the long-tail distribution of their class frequency and select the data-scarce classes on the distribution tail to be the novel classes.
对于COCO数据集中的新类[25],它们在现实世界中非常常见。从ImageNet中删除它们没有删除datascarce类那么有意义。因此,我们建议对于像COCO这样的大规模数据集,我们应该遵循其类频率的长尾分布,并选择分布尾部的数据稀缺类作为新类。
B. Visualization of Relation Reasoning
Figure 7 visualizes the correlation maps between the semantic embeddings of novel and base classes before and after the relation reasoning, as well as the difference between the two maps. Nearly all the correlations are increased slightly, indicating better knowledge propagation between the two groups of classes. Additionally, it is interesting to see that some novel classes get more correlated than others, e.g. “sofa” with “bottle” and “sofa” with “table”, probably because “sofa” can often be seen together with “bottle” and “table” in the living room but the original semantic embeddings cannot capture these relationship.
图7显示了关系推理前后新类和基类的语义嵌入之间的相关映射,以及两个映射之间的差异。几乎所有的相关性都略有增加,表明两组班级之间的知识传播更好。此外,有趣的是,一些新颖的类比其他类关联性更强,例如“沙发”与“瓶子”和“沙发”与“桌子”,这可能是因为“沙发”在客厅中经常与“瓶子”和“桌子”一起出现,但原始语义嵌入无法捕捉这些关系。
C. Using Other Word Embeddings
In the semantic space projection, we represent the semantic space using word embeddings from the Word2Vec [27]. We could simply set the We to be random vectors. Additionally, there are other language models for obtaining vector representations for words, such as the GloVe [31]. The GloVe is trained with aggregated global word-word cooccurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. We also explored using word embedding with different dimensions from the GloVe in the semantic space projection step and compared with the results by the Word2Vec. Performance on the VOC Novel Set 1 is reported in Table 7. The Word2Vec can provide betterrepresentations than the GloVe of both 300 dimensions and 200 dimensions. The performance of random embeddings is significantly worse than the meaningful Word2Vec and GloVe, which again verifies the importance of semantic information for shot-stable FSOD.
在语义空间投影中,我们使用Word2Vec中的单词嵌入来表示语义空间[27]。我们可以简单地将We设置为随机向量。此外,还有其他语言模型用于获取单词的向量表示,如手套[31]。 该手套使用来自语料库的聚合全局单词共现统计数据进行训练,得到的表示展示了单词向量空间有趣的线性子结构。我们还探索了在语义空间投影步骤中使用不同维度的单词嵌入,并与Word2Vec的结果进行了比较。表7中报告了VOC新型装置1的性能。与300维和200维手套相比,单词2VEC可以提供更好的表现。随机嵌入的性能明显低于有意义的Word2Vec和GloVe,这再次验证了语义信息对于样本稳定FSOD的重要性。
D. Reduced Dimension in Relation Reasoning
In the relation reasoning module, the dimension of word embeddings is reduced by linear layers before computing the attention map, which saves computational time. We empirically test different dimensions and select the one with the best performance, i.e. when the dimension is 32. But other choices are just slightly worse. Table 8 reports the results on VOC dataset under different dimensions. All the experiments are following the same setting as in the main paper. The only exception is that we use ResNet-50 [16] to reduce the computational cost of tuning hyperparameters.
在关系推理模块中,在计算attention map之前,通过线性分层降低单词嵌入的维数,从而节省了计算时间。我们对不同维度进行了实证测试,并选择性能最好的维度,即维度为32时。但其他的选择只是稍微差一点。表8报告了不同维度下VOC数据集的结果。所有的实验都遵循与主要论文相同的设置。唯一的例外是,我们使用ResNet-50[16]来降低调整超参数的计算成本。
E. Finetuning More Parameters
Similar to TFA [39], we have a finetuning stage to make the detector generalized to novel classes. For the classification subnet, we finetune the parameters in the relation reasoning module and the projection matrix while all the parameters in previous layers are frozen. Some may argue that the improvement of our SRR-FSD over the baseline is due to more parameters finetuned in the relation reasoning module compared to the Faster R-CNN [35] baseline. But we show that finetuning more parameters does not necessarily lead to better results in Table 9. We take the TFA model which is essentially a Faster R-CNN finetuned with only the last layer trainable and gradually unfreeze the previous layers. It turns out more parameters involved in finetuning do not change the results substantially and that too many parameters will lead to severe overfitting.
与TFA[39]类似,我们有一个微调阶段,使检测器可以推广到新的类别。对于分类子网,我们对关系推理模块和投影矩阵中的参数进行了微调,同时冻结了前几层中的所有参数。有些人可能会认为,与Faster R-CNN[35]基线相比,我们的SRR-FSD在基线上的改进是由于关系推理模块中微调了更多的参数。但我们在表9中显示,微调更多参数并不一定会导致更好的结果。我们采用TFA模型,该模型本质上是一个Faster R-CNN,仅最后一层可训练,并逐渐解冻前一层。事实证明,微调中涉及的更多参数不会实质性地改变结果,过多的参数将导致严重的过度拟合。
F. Complete Results on VOC
In Table 10, we present the complete results on the VOC [13] dataset as in FSRW [19] and Meta R-CNN [44]. We also include the very recent MPSR [42] for comparison. MPSR develops an auxiliary branch to generate multi-scale positive samples as object pyramids and to refine the prediction at various scales. Note that MPSR improves its base-line by a considerable margin but its research direction is orthogonal and complimentary to ours because it is still exclusively dependent on visual information. Therefore, our approach combining visual information and semantic relation reasoning can achieve superior performance at extremely low shot (e.g. 1, 2) conditions.
在表10中,我们展示了FSRW[19]和Meta R-CNN[44]中VOC[13]数据集的完整结果。我们还包括最近的MPSR[42]以供比较。MPSR开发了一个辅助分支,以生成多尺度正样本作为目标金字塔,并在不同尺度上细化预测。请注意,MPSR在很大程度上改善了其基线,但其研究方向是正交的,与我们的方向是互补的,因为它仍然完全依赖于视觉信息。因此,我们的方法结合了视觉信息和语义关系推理,可以在极低的镜头(例如1,2)条件下获得优异的性能。
G. Interpretation of the Dynamic Relation Graph
In the relation reasoning module, we propose to learn a dynamic relation graph driven by the data, which is conceptually different from the predefined fixed knowledge graphs used in [40, 8, 30]. We implement the dynamic graph with the self-attention architecture [37]. Although it is in the form of a feedforward network, it can also be interpreted as a computation related to the knowledge graph. If we denote the transformations in the linear layers f, g, h, l as Tf, Tg, Th, Tl respectively, we can formulate the relation reasoning in Eq. (4)
在关系推理模块中,我们建议学习由数据驱动的动态关系图,这在概念上不同于[40,8,30]中使用的预定义固定知识图。我们使用自关注架构实现了动态图[37]。虽然它是前馈网络的形式,但也可以解释为与知识图相关的计算。如果我们将线性层f、g、h、l中的变换分别表示为Tf、Tg、Th、Tl,我们可以在等式(4)中建立关系推理
where W'e is the matrix of augmented word embeddings after the relation reasoning which will be used as the weights to compute classification scores and δ is the softmax function operated on the last dimension of the input matrix. The item can be interpreted as a N × N dynamic knowledge graph in which the learnable parameters are Tf and Tg. And it is involved in the computation of the classification scores via the graph convolution operation [21], which connects the N word embeddings in We to allow knowledge propagation among them. The item ThTl can be viewed as a learnable transformation applied to each embedding independently.
?其中,W'e是关系推理后的增广词嵌入矩阵,将用作计算分类分数的权重,δ是在输入矩阵的最后一个维度上操作的softmax函数。该项可解释为一个N×N动态知识图,其中可学习参数为Tf和Tg。它通过图卷积运算[21]参与分类分数的计算,该运算连接We中的N个单词嵌入,以允许知识在它们之间传播。项目ThTl可被视为独立应用于每个嵌入的可学习转换。
?