Skip to content

New approaches to Deep Networks – Capsules (Hinton), HTM (Numenta), Sparsey (Neurithmic Systems) and RCN (Vicarious)


Reproduced left to right from [8,10,1]

Within a 5 day span in October, 4 papers came out that take a significantly different approach to AI hierarchical networks. They are all inspired by biological principles to varying degrees. It’s exciting to see different ways of thinking. Particularly at a time when there is growing skepticism relating to how far standard convolutional neural networks can take us towards Artificial General Intelligence, AGI. For some examples, see MIT Tech Review, blog on video from Hinton, or presentation from Le Cun.

The common theme is a use of iterative feedforward (bottom-up) and feedback (top-down) passes and integrating this data in a manner similar to Belief Propagation. The feedback pass exploits the ‘generative’ properties of the network and can be thought of as utilising priors. It is sometimes referred to as a top-down attentional mechanism, or as the network ‘imagining’ the possibilities. Nodes accumulate evidence from neighbouring sensory areas via feedback and in some cases lateral connections are explicitly utilised. The network finds a broader consensus on the best hypothesis.  Another feature common to some of these studies and inspired by neuroscience, is the use of sparse coding. It provides a more powerful and robust representation by distributing the state amongst many attributes in combination. See our article for more information.

I believe these approaches are a significant step toward more general purpose AI. I was particularly excited to see them and was keen to summarise the approaches and results below. They don’t explicitly address the characteristics of Continuous Learning so I haven’t gone into detailed analysis along those dimensions in this article.

Dynamic Routing Between Capsules – Sabour et. al, Google Brain

Hinton (the last author) needs no introduction. He’s one of the most significant researchers to bring convolutional neural nets (CNNs) to prominence, but has been critical of their limitations in recent years. He’s been developing a new Capsule Network theory [1] and the ML community was eagerly anticipating this paper.


The Capsule Network is a multilayer neural network where neurons are grouped into ‘capsules’. Each neuron in a capsule represents a property, such as pose (position, size, orientation), deformation, texture, colour and movement. Each capsule represents an ‘entity’. It’s vector length is the probability of existence, and the orientation represents the component properties. A convolutional network with multiple layers is constructed using these capsules.


Iterative message passing is used for inference in a process they call Dynamic Routing. A feedforward pass activates capsules which then feed back with proportionate strength updating an ‘agreement’ value between feedforward and feedback. This finds agreement between bottom up activations and top down concepts ‘generated’ by the available evidence. This occurs iteratively for a fixed number of iterations until a classification is achieved. Dynamic Routing is “an effective way to implement the ‘explaining away’ that is needed for segmenting highly overlapping objects [1].”

Position information is not lost as with max pooling, and the learning of capsule properties allows the network to learn “viewpoint invariant” knowledge that generalizes to novel viewpoints more effectively than conventional neural networks. That is demonstrated by the high accuracy of recognising MNIST characters subject to affine transformations (beyond simple translation).


Supervised learning is used to train the system for the connection weights as well as priors for the ‘agreement’ values between capsules. The objective function is based on both classification error as well as reproduction error. The latter is often seen in unsupervised generative models, and would assist in creating meaningful ‘generative’ top-down messages used in Dynamic Routing.


This capsule network was tested on a range of well known datasets, from simple digits of MNIST, to multiple digits and digits with affine transformations, to object recognition in images using MNIST, MultiMNIST (multiple overlapping digits), CIFAR10, smallNORB and SVHN. The network performed better than a baseline CNN, and better or competitive with state of the art on equivalent tests. Also, it appears to perform better than state of the art in some aspects i.e. robustness to affine transformations and ability to segment with overlapping images.

A Theory of How Columns in the Neocortex Enable Learning the Structure of the World – Hawkins et. al, Numenta

This study [2] offers a computational model for microcortical circuits and builds on recent work and earlier publications dating back to 2005 [3,4,5,6]. The model has evolved and is significantly different to previous versions. This work is focussed on understanding the functioning of the neocortex rather than machine learning and AI explicitly. This is one of few neocortical computational models at this level of abstraction. Models such as this are necessary to understand the holistic functionality of a cortical column, and how it is able to achieve highly effective sensorimotor signal processing.

The approach is motivated by the fact that for recognition tasks accomplished by a human (or other animal), sequences of sensory input are received and combined to make a ‘classification’.

The test problem is recognition of objects by grasping them. This is a practical robotic application, more so than many other ML projects. However, the problem construction is much simpler than a physical model and hand crafted. It’s refreshing to see something quite different from standard image recognition datasets, but makes it impossible to compare to established benchmarks and therefore to compare general effectiveness.


The model is a neural network utilising HTM neurons [7] which resemble biological pyramidal neurons. These neurons are more complex than conventional artificial neural network neurons, with multiple groups and types of input connections (dendrites) with different functions. There are dendrites that are stimulators, and those that are modulatory, and predict activations. There is an input and output layer (resembling two of the cortical pyramidal cell layers) with feedback and lateral input connections. Neurons are arranged into columns that cover a subset of the input space. Lateral connections are made between and within columns in the output layer. They use configurations of single and multiple columns. Hierarchies have not yet been explored. Intra column inhibition comprising competitive learning and inference, ensures sparse coding the input layer.

The input to the network represents fingers grasping an object, consisting of sensory input and a location on the object. Each is represented by a sparse binary array. Calculation of location is not in scope, but there is evidence for and a model for how the brain may calculate such a signal.


The model learns unsupervised through Hebbian adaptation “when cells fire, previously active synapses are strengthened and inactive ones are weakened.” The input layer learns to represent the input patterns (locations are modulatory and ‘predict’ the corresponding sensory features), the output layer represents the object being presented to the network.

The input and output layers are reciprocally connected – input feeds forward to the output layer, and the output layer feeds back to the input layer. The output layer’s feedback is stable over the different sensory/location sensations for an object. Therefore, the set of cells that correspond to the input patterns for a given object become connected to the output cells that represent that object. The feedback is connected to the modulatory dendrites of the input layer, thereby learning to predict activation of the input sensations for a given object.


For each sensation, the input layer activates a set of possible objects in the output layer. Via feedback, the output then ‘predicts’ the activations for the set of possible objects. This acts like a mask for subsequent sensations, which only activate a subset of predicted cells narrowing the classification until it is unambiguous (if possible). Evidence is accumulated over time with different sensory inputs for the same object as the network reaches a unique classification, but not by explicit representation of sequences.


The network was able to recognise several hundred objects. It appears that the model was trained and tested on the same data, likely to be because of the nature and size of the dataset. Therefore, it is hard to draw conclusions about accuracy and generalisability. This is disappointing given the purported properties of SDRs, one of which is that it is possible to compare similarity i.e. if the network is shown two very similar objects, the outputs should be the same, and if somewhat similar, then similar outputs but not necessarily the same.

The properties of convergence (recognition speed), capacity and robustness were analysed. This uncovered attractive and intuitive characteristics. They showed that multiple columns increases recognition speed, as does the number of unique features. More stored objects results in a longer time to recognize the object.

In terms of capacity, the ability to recognize correctly improves as the number of stored objects increases. More input minicolums or output cells results in greater capacity, but the effect drops away quickly. The capacity is limited by the number of connections between input and output layers. Effectively, “the capacity of the network is limited by the pooling capacity of the output layer” and increasing the number of columns does not help significantly.

The system was relatively robust to noise. When it reached significant levels, up to 20% in sensory input and 40% in location input, the convergence speed did slow down.

The architecture is a biological computational model that is supported by neuroscience research. However, there are many assumptions about the function of other parts of the brain, specifically grid cells to provide the location signal. Also, there is empirical evidence that parts of the model are not supported by biological knowledge, related to timing of signals in the output cortical layers. Nevertheless, it’s good to see new approaches to artificial intelligence and sensorimotor processing inspired by neocortex. The simulation results help to bolster the model as well as make testable predictions.

A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs – George et. al, Vicarious

George et. al from Vicarious have published their Recursive Cortical Network, RCN [8]. George and Hawkins worked together on the first versions of HTM [3,4,5]. That was based on a biologically compatible version of belief propagation. It’s interesting to see how they’ve developed their respective approaches from a common beginning each preserving key components of it, but developing quite differently in other ways.

RCN draws on insight that human vision treats shape and appearance separately and a known object with an unexpected colour is still easily recognisable. They define Shape as comprising contours represented by edges, and Appearance comprising surface texture represented by colours. They tested the system on challenging text recognition tasks.

Additionally, RCN incorporates many neuroscientific principles. Hierarchy, top down and bottom up signal transmission, separate mechanisms for contour and appearance, and lateral connections for contour consistency.


A hierarchy is used to recognize Shape. It is a graph of binary random variables representing edge features with alternating feature and pooling layers. Features are pooled over invariances such as translation i.e. a corner translated across multiple positions is the same corner. Lateral connections between pooling nodes are also important, providing ‘evidence’ that for example, a continuous line between areas of an image exists. Appearance is represented by a conditional random field (CRF), conditioned on the features. Shape and Appearance are then combined for recognition.

In the published version, pooling is implemented as hardcoded translational pooling. Their intention is for it to be more general and learnt from temporal proximity, as envisaged in the early work with Hawkins.


Inference is carried out by finding the maximum a posteriori (MAP) estimate for the image with message passing with one bottom-up forward pass and one top-down backward pass. The forward pass identifies hypotheses for the location and categories of objects, and the backward pass (with lateral propagation) identifies segmentation masks for the objects. For multiple objects, there is an outer loop that visits the hypotheses and selects the best subset of objects.


Training is essentially unsupervised. If the image cannot be explained by features at a given level n, then the active features from the level below (n-1) are grouped into a new feature at n. Features are then pruned with a cost function that incorporates reconstruction and compression error. The final layer is trained with supervising labels. A new class is created for every image, but due the unsupervised training and abstractions, only a subset of the dataset is used. Training occurs as a batch process, level by level.


The study is focussed on CAPTCHA, it was tested on many variants, but RCN was also tested on a variety of image types, including MNIST with occlusion and ICDAR-13.

The results show that RCN can segment characters with occlusion, overlap, complex textures and is robust to changes in character spacing. A major advantage is that it is unsupervised and able to learn from relatively very few examples. In addition, it showed an ability for one-shot and few-shot learning (on handwritten digits).

Recognition across the range of CAPTCHA variants was impressive at about 60% or higher. They didn’t include comparison to other published studies. They did include control experiments with CNNs, but it is hard to make a fair comparison as there are outer mechanisms that are required for good performance.

Superposed Episodic and Semantic Memory via Sparse Distributed Representation – Rinkus and Leveille, Neurithmic Systems

Another paper to come out in the same period reports on a system called Sparsey by Rinkus. It looks at the relationship between semantic and episodic memory and displays a capability for both [9].


The core memory component is an unsupervised, Hebbian, hierarchical, associative memory. A single layer is similar to the input layer of Hawkins’ paper above. They actually published the core algorithm many years ago in 2010 [10] in a paper that was also focussed on building a computational model of cortical circuits. It describes a layer of pyramidal cells as associative memory functioning with storage and retrieval of sparse distributed representations (referred to as codes). It is the clearest explanation I’ve come across of microcolumns, macrocolumn and their functional relationship at a level of abstraction that is useful for building signal processing for AI (in fact we base our working definition on this work). I won’t go into more details of the core component in this article.

In this recent work, a three level hierarchy was used. It included a supervised SVM final layer to extract recognisable classifications as is commonly done to test unsupervised learning algorithms.

The algorithm does not involve bidirectional message passing like the others discussed here. A resulting advantage is that retrieval is constant time. A notable feature is a one-shot learning ability (regarded as providing episodic memory). It only needs to see an input once to store the associated code.


The system was tested on MNIST and the Weizmann video event recognition datasets. The results were not as good as state of the art, but competitive with high accuracy and orders of magnitude less training data were required.

I intend to return to this paper in future articles that delve into the properties and potential implementations of episodic memory.

Closing remarks

We’ve seen great progress in AI over the last few years with deep learning. The reviewed crop of papers offer an exciting glimpse into what could be a new approach to deep networks. With stronger analogies to biological systems, an ability to find meaningful invariances and both bottom up and top down information flow to find a more globally optimal hypothesis.


  1. Sabour, S., Frosst, N. & Hinton, G. Dynamic Routing between Capsules. (2017).
  2. Hawkins, J., Ahmad, S. & Cui, Y. Why Does the Neocortex Have Layers and Columns, A Theory of Learning the 3D Structure of the World. bioRxiv July 12, 0–15 (2017).
  3. George, D. & Hawkins, J. A hierarchical Bayesian model of invariant pattern recognition in the visual cortex. in 3, 1812–1817 vol. 3 (2005).
  4. Hawkins, J. & George, D. Hierarchical temporal memory: Concepts, theory and terminology. Whitepaper, Numenta Inc 20 (2006).
  5. George, D. & Hawkins, J. Towards a mathematical theory of cortical micro-circuits. PLoS Comput. Biol. 5, (2009).
  6. Hawkins, J., Ahmad, S. & Dubinsky, D. HTM Cortical Learning Algorithms. 1–68 (2011).
  7. Hawkins, J. & Ahmad, S. Why neurons have thousands of synapses, a theory of sequence memory in neocortex. Front. Neural Circuits 10, 1–13 (2016).
  8. George, D. et al. A Generative Vision Model that Trains with High Data Efficiency and breaks text-based CAPTCHAs. Science (80-. ). 1–19 (2017).
  9. Rinkus, G. J. A cortical sparse distributed coding model linking mini- and macrocolumn-scale functionality. Front. Neuroanat. 4, 17 (2010).
  10. Rinkus, R. & Leveille, J. Superposed Episodic and Semantic Memory via Sparse Distributed Representation. (2017).

2 thoughts on “New approaches to Deep Networks – Capsules (Hinton), HTM (Numenta), Sparsey (Neurithmic Systems) and RCN (Vicarious)”

  1. Thank you for the Rinkus stuff, I wasn’t aware of that. It’s very similar to what I have been thinking about, so highly interesting to me.

    Could you point me to the actual “learning” algorithm? I went through the CSA in some detail (in “A cortical sparse distributed…”), but I’m still missing the step were the weights are actually adjusted.

    Ah, found it: “These BU weights (synapses) are binary, initially 0, and are permanently set to a weight of 1 the first time the pre- and postsynaptic units are co-active (i.e., Hebbian learning).”

    Ok, another stupid question: Why do they vary the test sample number for MNIST (on page four of “Superposed Episodic and Semantic Memory…”)? Wouldn’t you just use all samples not used in training to get the most accurate idea of your accuracy? Maybe they keep on learning during training?

    It’s also weird that they don’t try to use an ensemble (which in their model would just be a scale up) to boost accuracy.

Leave a Reply