论文地址:https://arxiv.org/pdf/1710.09829.pdf
Abstract
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.
摘要
胶囊是一组用活动向量来表示一种特定类型实体的实例化参数的神经元,这种实体像一个对象或是对象的一部分。我们用这个活动向量的长度来表示实体存在的可能性,它的方向表示实例化参数。激活胶囊在某种程度上通过转换矩阵为更高级胶囊的实例化参数做预测。当多个预测生效,高级的胶囊就会被激活。我们展示了一个基础的训练,多层的胶囊系统在MNIST数据集上获得了很好的性能,并在高度重合的数字图片上明显优于卷积神经网络。我们使用迭代路由生效(routing-by-agreement)的机制获得这些结果:低级的胶囊将它的输出发送给更高级的胶囊,高级胶囊的激活向量产出一个源于低级胶囊的进行预测的大标量。
1 Introduction
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Introspection is a poor guide to understanding how much of our knowledge of a scene comes from the sequence of fixations and how much we glean from a single fixation, but in this paper we will assume that a single fixation gives us much more than just a single identified object and its properties. We assume that our multi-layer visual system creates a parse tree-like structure on each fixation, and we ignore the issue of how these single-fixation parse trees are coordinated over multiple fixations.
1 介绍
人类视觉通过使用一种认真确定序列的固定点来忽略一些无关的细节,以确保只有一小部分视觉阵列会被最高分辨率处理。自我审视来指导理解一个源于固定序列的场景并不奏效,但是,这篇论文中我们假设单独的固定不仅仅给了我们单个的确定对象或者是它的属性。我们假设这种多层的视觉系统会生成像是创建在每个固定上的一个解析树,并且忽略这些单独固定解析树是如何协调多个固定的。
Parse trees are generally constructed on the fly by dynamically allocating memory. Following Hinton et al. [2000], however, we shall assume that, for a single fixation, a parse tree is carved out of a fixed multilayer neural network like a sculpture is carved from a rock. Each layer will be divided into many small groups of neurons called “capsules” (Hinton et al. [2011]) and each node in the parse tree will correspond to an active capsule. Using an iterative routing process, each active capsule will choose a capsule in the layer above to be its parent in the tree. For the higher levels of a visual system, this iterative process will be solving the problem of assigning parts to wholes.
解析树通常通过动态分配内存构建,根据Hinton2000的论文,我们假设,每一个单点固定,一个解析树由多层神经网络构成,就像雕塑品从原始岩石中逐步形成一样。每一层将会被分成叫做“胶囊“的小组的神经元,每个解析树的节点将包含一个活跃胶囊。使用一种迭代路由处理方法,每个活跃胶囊将在它的父节点上选择一个胶囊,对于高等级的视觉系统,这种迭代处理将解决由部分到整体的视觉分配问题。
The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity1. We ensure that the length of the vector output of a capsule cannot exceed 1 by applying a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.
在活跃的胶囊中,神经元的活动代表出现在图片中某个特定实体的各种属性。这些属性包括实例化参数的不同类型,如姿势(位置、大小、方向)、变形、速度、反照率、颜色、纹理等,一个非常特殊的属性就是图片中实例化实体的存在形态。一种显而易见的方式是通过使用一个输出为实体存在概率的逻辑单元,本文中我们使用了一种有趣的提到方法,这种方法使用实例化参数向量的长度来表示实体的存在概率,促使实向量的方向来表示实体属性。通过应用一种非线性变换确保一组胶囊向量输出的长度不超过1,这种非线性变换只会改变尺度大小,二不会影响向量方向。
The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. We demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects.
胶囊的输出是一个可以使用强大的动态路由机制来确保胶囊的输出发送到一个适当的上层父节点的向量,起初,输出路由到所有可能的父节点,但会成比例缩小,系数和为1,对于每一个可能的父节点,胶囊会通过它们自身的输出乘以权重矩阵来计算“预测向量”,如果这个预测向量在某个可能的父节点中获得了大的标量,就会使用一种自上而下的反馈来增加这个父节点的系数,减小其他父节点的系数 。