We’ve just uploaded a spin-off research paper to arXiv titled “Sparse Unsupervised Capsules Generalize Better”. So what’s it all about?
You may have heard of Capsules Networks already – if not, have a read of one of these blog articles (here, here, here, or here (EM routing)), watch this video, or consult one of the two recent key papers:
- Dynamic Routing between Capsules (Sabour et al. 2017)
- Matrix Capsules with EM-Routing (Hinton et al. 2018)
Briefly, Capsules output a vector of parameters that describe the state of the entity represented by the capsule – for example, the pose of an object, or some information about the shape of a specific object instance. After training, it is shown that Capsules discover “equivariances”: These are ways in which the parameters can describe continuous changes in an entity. In the case of MNIST digits, this includes stroke width and digit skew. In our work, which is unsupervised, equivariances also include digit transformations (e.g. from a 3 to a 5, as shown in the figure above).
In addition, Capsules networks have a mechanism called Dynamic Routing that builds a “parse-tree” from the Capsules. The parse tree is a subset of Capsules across many layers that collectively agree on what is being observed in the input. As a result, after routing, the Capsules network describes the configuration of a set of entities it has found in the input, using a subset of all the available Capsules.
Attention & Selective Memory
We are interested in capsules for several reasons. First, there may be important representational gains from the Capsules approach. We are also impressed with the dynamic routing mechanism of integrating feedback in a stable manner, and that routing can also be used as a selection mechanism to drive the memory towards particular perceptions. We can already see an example of this in the Multi-MNIST classification task from Sabour et al. (see figure). Finally, routing also provides an attention mechanism, because routing weights can be targeted at particular areas of the hierarchy. So with a Capsules network, we get stable feedback integration, selective memory and an attention mechanism straight out of the box!
Given our focus, we made an unsupervised Capsules network derived from the Dynamic Routing between Capsules (Sabour et al. 2017) implementation. As our paper explains, simply making the network unsupervised didn’t work. We had to add a form of sparse training as well.
In Sabour et al. they trained the network on MNIST images and then tested classification accuracy on affNIST images (affine-transformed MNIST), achieving 79% accuracy. They also report 66% accuracy for a conventional convolutional network with a similar number of parameters. This suggests that the Capsules representation was able to generalize from MNIST to affNIST.
In our work, we trained our sparse unsupervised capsules network on MNIST and then used an SVM to classify affNIST digit labels given the activity of the deepest unsupervised capsules layer. We managed to improve the affNIST score to 90%! Hence, we conclude that sparse unsupervised capsules do generalize better than supervised capsules, at least in their current form.
We compared our score to all the affNIST results we could find, and noticed that our result is similar or better than most conventional networks even when they are trained on affNIST directly. This looks promising for capsules in general.
Our result also has another implication. Supervised training of latent capsules layers enforces sparseness in shallower layers too. But this effect will only work to a limited depth. Our investigation of the properties of dense unsupervised capsules suggests that without sparse training, you can’t have deep capsules networks. Sparse training might be a key enabler of deep capsules networks.
You can see some of the equivariances produced by our network in the headline image at the top of this page. Since the network is unsupervised, the equivariances produced include digit morphing.
My questions as a reviewer would be how much the SVM adds, i.e. which accuracy would the SVM achieve on its own, on how much data is it trained, what would be the accuracy with a simpler final layer, like linear or logistic regression. And of course, if you reach almost state of the art even for networks trained on affNIST, what accuracy does your network achieve if it is trained completely on affNIST?
Hi Philip! Good questions. We trained the SVM on 60,000 samples (disjoint to test set). A simpler final layer doesn’t work very well. In exactly the same conditions as our 90.12% nonlinear SVM result (with the same sparse-caps pretrained model), using logistic regression instead, the affNIST test accuracy is 48.32%. Clearly the latent capsules state needs a lot of interpretation. Log-Reg isn’t able to do a good job of combining these features. Probably a 2-layer fully connected feed-forward network (ReLU, then softmax) would be comparable to SVM.
I would worry about this more but for the fact that the final result is so close to or better than the state-of-the-art for any algorithm on the MNIST–>affNIST generalization task. Given that a 40-layer ANN with a lot of task-specific customization achieved 91.6%, and all other results from conv-nets were in the 80-86% range, I expect just SVM by itself would do a terrible job of this, although I haven’t tested it (you’d have to do some dimensionality reduction *somehow*).
We haven’t actually tried training the whole network on affNIST. This algorithm is a temporary one, since we think we have a replacement for the capsule consensus mechanism that we expect to be much better. We will have results for this in a month or two and will put out another paper on the new algorithm.
Ok, that sounds exciting. Then I’m looking forward to read about the new algorithm.