¿Cómo funciona la "teoría de las cápsulas" de Hinton?


Geoffrey Hinton ha estado investigando algo que él llama "teoría de las cápsulas" en las redes neuronales. ¿Qué es esto y cómo funciona?

Now this paper can be viewed at: arxiv.org/abs/1710.09829 Dynamic Routing Between Capsules Sara Sabour, Nicholas Frosst, Geoffrey E Hinton
Danke Xie

There is a related question with newer information (November 2017): What's the main concept behind Capsule Networks?



It appears to not be published yet; the best available online are these slides for this talk. (Several people reference an earlier talk with this link, but sadly it's broken at time of writing this answer.)

My impression is that it's an attempt to formalize and abstract the creation of subnetworks inside a neural network. That is, if you look at a standard neural network, layers are fully connected (that is, every neuron in layer 1 has access to every neuron in layer 0, and is itself accessed by every neuron in layer 2). But this isn't obviously useful; one might instead have, say, n parallel stacks of layers (the 'capsules') that each specializes on some separate task (which may itself require more than one layer to complete successfully).

If I'm imagining its results correctly, this more sophisticated graph topology seems like something that could easily increase both the effectiveness and the interpretability of the resulting network.

The paper is now (Oct 2017) published: arxiv.org/pdf/1710.09829.pdf


To supplement the previous answer: there is a paper on this that is mostly about learning low-level capsules from raw data, but explains Hinton's conception of a capsule in its introductory section: http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf

It's also worth noting that the link to the MIT talk in the answer above seems to be working again.

According to Hinton, a "capsule" is a subset of neurons within a layer that outputs both an "instantiation parameter" indicating whether an entity is present within a limited domain and a vector of "pose parameters" specifying the pose of the entity relative to a canonical version.

The parameters output by low-level capsules are converted into predictions for the pose of the entities represented by higher-level capsules, which are activated if the predictions agree and output their own parameters (the higher-level pose parameters being averages of the predictions received).

Hinton speculates that this high-dimensional coincidence detection is what mini-column organization in the brain is for. His main goal seems to be replacing the max pooling used in convolutional networks, in which deeper layers lose information about pose.


Capsule networks try to mimic Hinton's observations of the human brain on the machine. The motivation stems from the fact that neural networks needed better modeling of the spatial relationships of the parts. Instead of modeling the co-existence, disregarding the relative positioning, capsule-nets try to model the global relative transformations of different sub-parts along a hierarchy. This is the eqivariance vs. invariance trade-off, as explained above by others.

These networks therefore include somewhat a viewpoint / orientation awareness and respond differently to different orientations. This property makes them more discriminative, while potentially introducing the capability to perform pose estimation as the latent-space features contain interpretable, pose specific details.

Todo esto se logra al incluir una capa anidada llamada cápsulas dentro de la capa, en lugar de concatenar otra capa en la red. Estas cápsulas pueden proporcionar salida vectorial en lugar de una escalar por nodo.

La contribución crucial del trabajo es el enrutamiento dinámico que reemplaza la agrupación máxima estándar por una estrategia inteligente. Este algoritmo aplica una agrupación de desplazamiento medio on the capsule outputs to ensure that the output gets sent only to the appropriate parent in the layer above.

Los autores también combinan las contribuciones con una pérdida de margen y pérdida de reconstrucción, que simultáneamente ayudan a aprender mejor la tarea y muestran resultados de vanguardia en MNIST.

El artículo reciente se llama Enrutamiento dinámico entre cápsulas y está disponible en Arxiv: https://arxiv.org/pdf/1710.09829.pdf .


Based on their paper Dynamic Routing between Capsules

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation paramters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule. The final version of the paper is under revision to encorporate reviewers comments.

A good answer is usually more than just a quote. You can usually restate in a clearer way or go into further depth. Very rarely is just a quote all that takes to make a good answer. Do you think that you could improve this a bit by editing?


One of the major advantages of Convolutional neural networks is their invariance to translation. However this invariance comes with a price and that is, it does not consider how different features are related to each other. For example, if we have a picture of a face CNN will have difficulties distinguishing relationship between mouth feature and nose features. Max pooling layers are the main reason for this effect. Because when we use max pooling layers, we lose the precise locations of the mouth and noise and we cannot say how they are related to each other.

Capsules try to keep the advantage of CNN and fix this drawback in two ways;

  1. Invariance: quoting from this paper

When the capsule is working properly, the probability of the visual entity being present is locally invariant – it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule.

In other words, capsule takes into account the existence of the the specific feature that we are looking for like mouth or nose. This property makes sure that capsules are translation invariant the same that CNNs are.

  1. Equivariance: instead of making the feature translation invariance, capsule will make it translation-equivariant or viewpoint-equivariant. In other words, as the feature moves and changes its position in the image, feature vector representation will also change in the same way which makes it equivariant. This property of capsules tries to solve the drawback of max pooling layers that I mentioned at the beginning.
Al usar nuestro sitio, usted reconoce que ha leído y comprende nuestra Política de Cookies y Política de Privacidad.
Licensed under cc by-sa 3.0 with attribution required.