This is the second part of our comparison between convolutional competitive learning and convolutional or fully-connected sparse autoencoders. To understand our motivation for this comparison, have a look at the first article.
We decided to compare two specific algorithms that tick most of the features we require: K-Sparse autoencoders, and Growing-Neural-Gas-with-Utility (GNG-U) (Fritzke1997). The latter has proven ability to learn statistically optimal representations even given nonstationary input, under certain conditions.
We found the K-Sparse Autoencoder scheme of Makhzani and Frey (Makhzani2013) particularly appealing due to the simple manner of achieving the desired sparsity: They simply find k cells with the highest hidden layer activity, and then mask to zero the activity of the remaining hidden cells. It is also possible to directly target any desired sparsity. One downside of the first paper is that the only mechanism to ensure high cell utilization is to train very slowly, with good weight initialization. In their follow-up paper, Winner-Take-All Convolutional Sparse Autoencoders (Makhzani2015), they introduced the concept of lifetime sparsity: Cells that aren’t used often are trained on the most fitting batch samples to ensure high cell utilization over time.
In fact, the realization that you can pretty much use any method you want to include or exclude cells from the active population in the hidden layer means that the k-sparse scheme is a very flexible algorithm and could potentially exploit this flexibility to deal with nonstationary data.
A nice feature of the k-sparse autoencoder is that the completeness of the models learned by the algorithm depend on the sparsity; with greater sparsity, cells represent more complete patterns. With less sparsity, cells learn to represent stable parts of patterns (see figure).
Convolutional Growing Neural Gas with Utility (GNG-U)
This section describes how we can make GNG or GNG-U a convolutional algorithm. For readers who are unfamiliar with GNG, the first part of this article gives an introduction and links to relevant sources.
Receptive Field Geometry
We can use conventional convolutional geometry for a GNG network layer. The number of cells available to the GNG is Z, the depth of the layer. This is also known as the number of filters in a normal convolutional network. We can define a fixed receptive field and field tiling / stride as normal. The GNG is applied to each convolutional x,y location and the content of the receptive field at these locations is computed as normal. The output of the GNG at each location is two vectors of Z elements each:
- e the sum-square-error of each GNG cell and
- m a ternary mask indicating the winner (m[winner]=2) and runner-up cell (m[runnerup]=1), with other cells having value m[i]=0.
The GNG is trained at each convolutional location, albeit with a proportionally reduced learning rate to account for the number of training steps per image (note: Batch training with a higher learning rate would likely be superior but was not implemented). A winner and runner up are selected at each convolutional location, producing a 3-D ternary matrix with fixed sparsity.
We can build a stack of convolutional GNG layers using interleaved convolutional and pooling layers. In our experiments, we tested 1, 2 and 3 convolutional + pooling layers. Note that for simultaneous online training, this assumes that the GNG(U) can handle nonstationary input.
GNG is inherently a winner-take-all algorithm, or perhaps, a k=2 -sparse algorithm (it selects a “winner” and a “runner up”, which are modelled as neighbours for training). Normally, the hidden layer activity in convolutional networks is used as an input and output to pooling; we take the max value for each z in the x,y locations being pooled.
However, with GNG we only have the sum-square-error values e and the ternary mask indicating the winners. Neither is obviously equivalent to hidden layer activity in autoencoders or convolutional networks; sum-square-error becomes smaller when the cell is “better”, and is not sparse. The winner and runner-up ternary mask is very sparse, and it is not clear how we should represent the difference in error between the two (sometimes they have similar error, but not always).
Choosing a good output function is a balance between signal loss and noise removal. It is important to produce an output that is stable under reasonable conditions, enabling additional layers to learn. We investigated sparse binary outputs, sparse ternary outputs, functions derived from sum square error and functions, functions derived from cell ranking at each location, and nonlinear functions of sum square error (see table below).
The headline image at the top of this article shows convolutional GNGU receptive fields as yellow crosses, with the red matrix underneath displaying the output of the pooling input function for each GNGU cell. Winning cells have blue outlines.
In all cases, we define a function f( e, m ) that returns a pooling input value v[z] for each cell z at each receptive field position x,y. These values are then pooled using a max operator over all spatial locations for a given z index.
|1. Winner-Take-All Mask||The best-matching cell is set to 1 and other cells to zero. |
We output the max for each cell z in the pooling window. The output is therefore binary and of fixed sparsity.
if( z is the winner ) v[z] = 1
else v[z] = 0
|2. Ternary Mask||As above, but runner-up cells are set to 0.5. We apply a max over the pooled cells as above. The output is ternary and of approximately fixed sparsity.|
if( z is the winner ) v[z] = 1
if( z is the runner up ) v[z] = 0.5
else v[z] = 0
|3. Inverse Error||Assuming that inputs are in the interval [0, 1], cell error can be normalized to a unit value:|
v[z] = 1- ( sqrt( sumSqError[z] ) / numInputs )
|4. Relative Rank||v[z] = exp( -Q * rank( z ) / maxRank )|
Q = 10, 12, 14
maxRank = Z (the depth of the convolutional network in cells)
|5. Negative Log-Likelihood||v[z] = min( 1, Q * -log( sqrt( sumSqError[z] ) / numInputs ) )|
Q = 0.1
These parameters were developed from a series of ad-hoc tests and with reference to successful configurations used in conventional convolutional networks applied to MNIST.
To ensure the results are comparable to the autoencoder algorithms, we tried to conserve the number of “cells” at approximately 1000 and the number of learned weights at approximately:
- Layer 1: Z=64 with 6x6x1 receptive field = 2,304 parameters
- Layer 2: Z=256 with 3x3x64 receptive field = 147,456 parameters
- Total: 149,760 learned parameters
This is a small number of parameters compared to modern deep networks, but the MNIST test problem also has small images.
Note that although the 2nd layer has a larger number of parameters, the inputs are very sparse given most of the pooling functions evaluated. We did not reduce the width and height of the top unsupervised layer to 1×1 because this function could be performed by the supervised classifier. Although a range of values were tested, the best configuration is shown.
|Parameter name||Best value[s] tested (layers separated by commas)|
|Layer Depths (Z)||64, 256|
|Receptive field size (W,H)||6×6, 3×3|
|Receptive field stride (X and Y)||2, 1|
|Input padding||0, 0|
|Layer convolutional W and H||12, 4|
|Pooling window size||2, 2|
One of the nice qualities of the GNG algorithm is that almost all the hyperparameters are defined in terms of time rather than qualities of the data. In theory, this should make it especially easy to adapt to novel problems. Although a range of values were tested, the best configuration is shown.
|Parameter name||Best value[s] tested|
|Edge Max Age||600|
|Weight Learning Rate||0.015|
|Neighbour Weight Learning Rate||Weight Learning Rate * 0.2|
|Utility Learning Rate||0.01|
|Stress Learning Rate||0.005|
We chose to test the convolutional GNG-U algorithm on the MNIST dataset with a variety of pooling functions. For comparison, we implemented and tested a K-Sparse Autoencoder with a Batch-Sparsity constraint (each cell must be trained on at least p samples from each batch, which are selected retrospectively).
The algorithms were tested in 2 phases.
Phase 1 – Unsupervised network training
For both algorithms we performed unsupervised pre-training of the GNG-U and K-Sparse representations for N epochs. Next, we disabled learning and exposed the network to 1 epoch of the MNIST training set (60,000 images) and 1 epoch of the MNIST test set (10,000 images). The output of the deepest layer was recorded as a matrix of features x images, in addition to the vector of image labels.
All layers were trained simultaneously in online fashion.
Phase 2 – Supervised Classifier
Finally, in all cases, a logistic-regression supervised classifier was trained on the unsupervised training-set output and then tested on the unsupervised test-set output. This provides a classification accuracy score. We considered using a more powerful classifier (e.g. nonlinear SVM) but this transfers more of the responsibility to the classifier and reduces our ability to measure the contribution of unsupervised classification, especially on the MNIST dataset where scores are already very high.
The convolutional GNG-U algorithm in conjunction with a logistic regression classifier had a tendency towards consistently high training accuracy (often close to 100%) with more difficulty generalizing to high test accuracy.
|Algorithm||Pooling Function||Best Training Accuracy (%)||Best Testing Accuracy (%)||Unsupervised training Epochs|
|Conv-GNG-U||1. Winner-Take-All Mask||55.1%||50%||1|
|Conv-GNG-U||2. Ternary Mask||61.9%||55.5%||1|
|Conv-GNG-U||3. Inverse Error||88.4%||80%||1|
|Conv-GNG-U||4. Relative Rank||94.1%||88.2%||2|
|Conv-GNG-U||5. Negative Log-Likelihood||99.4%||97.0%||8|
|K-Sparse Autoencoder with Batch-Sparsity constraint*|
(non-convolutional, can be stacked)
Batch size: 32
Batch-lifetime sparsity: 2
Sparsity output factor: 3x
|K-Sparse Autoencoder |
w. Logistic Regression layer
|Conv-WTA Autoencoder (Makhzani2015)w. Logistic Linear SVM layer||Max hidden layer values within pooled area||n/a||99.09%||n/a|
|Stacked Conv-WTA Autoencoder (Makhzani2015)w. Logistic Linear SVM layer||Max hidden layer values within pooled area||n/a||99.52%||n/a|
* Results from our Java re-implementation of the K-Sparse autoencoder with batch-lifetime-sparsity constraint from the later Conv-WTA paper. Since our implementation is written from scratch in Java without use of thoroughly tested third-party libraries, small numerical issues could account for the slight drop in performance.
The MNIST dataset is very well studied, making it good for comparative results; however, it is also “too easy” which results in good algorithms being clustered in the 98-99% accuracy interval (at some point, higher accuracy just means overfitting to the dataset as maybe 0.1-1% of the digits are extremely distorted and arguably unrecognizable).
Despite a large hyperparameter search, we were unable to get state-of-the-art performance from a convolutional GNG-U algorithm on the MNIST dataset (best result: 97%). Under the same unsupervised training regime with a supervised logistic regression classifier, we were able to reproduce approximately state-of-the-art performance (98.6%) with a k-sparse autoencoder despite a complete re-implementation of the entire backpropagation framework in Java, without optimizations.
The choice of GNG pooling function is critical. The simpler binary or ternary pooling functions probably do not contain enough information to make accurate classifications in ambiguous examples. We attempted to use more sophisticated encodings without introducing data-dependent hyperparameters, with reasonable but inadequate results.
Interestingly, the GNG algorithm family train very rapidly to a good performance level, but do not benefit from slower training over a larger number of epochs. In addition, the training set classification accuracy is very high in many configurations (over 99%), but does not generalize well to test accuracy. We tried dropout and greater regression regularization to reduce overfitting to training data, without improvement.
It is unsatisfying that we were not able to explore why a convolutional GNG-U is unable to benefit from more & slower training, like a convolutional autoencoder using stochastic gradient descent does. However, without promising results we can’t justify further investigation.
One possibility is that the lack of batch-training makes it harder for GNG-U to learn without a strong recency bias. Learning rate is normally balanced against batch size to enable effective training. With our GNG-U implementation we only have one of the two levers to control (learning rate, vs learning rate and batch size).
Another likely possibility is that the pooling functions tested did not effectively capture the available information in the GNG. However, we tried a variety of obvious functions given the available data, attempting to balance signal and noise, without success.
Most likely the root cause of the representational limit is the nature of the GNG-U algorithm which requires a fixed (winner-take-all) sparsity and does not variably distribute representational responsibility between cells like a k-sparse autoencoder does. On the face of it, this seems like a fundamental constraint that limits the flexibility of the approach and might explain the reduced classification performance.
We performed ad-hoc tests of the nonstationary capabilities of GNG-U by initially withholding half the digit classes and confirmed it works as advertised; the algorithm returns to optimal classification performance after all digit classes are restored.
So, can convolutional Growing-Neural-Gas (representing a family of competitive learning algorithms) compete with modern sparse autoencoders? No.
Source code for the implementations described here can be found in our Java repository on Github:
David Rawlinson and Gideon Kowadlo wrote the Java code for the algorithm implementations, executed and supervised experiments. Abdelrahman Ahmed set up and executed many of the experiments, and analyzed results.
Fritzke1997 – “A self-organizing network that can follow non-stationary distributions”. Bernd Fritzke. Proceedings of International Conference on Artificial Neural Networks, p. 613-618 (1997)
Makhzani2013 – “k-Sparse Autoencoders”. Makhzani, Alireza and Frey, Brendan. arXiv preprint arXiv:1312.5663 (2013)
Makhzani2015 – “Winner-Take-All Autoencoders”. Makhzani, Alireza and Frey, Brendan J. Advances in Neural Information Processing Systems (2015)
This was an interesting investigation, but I don’t think you should be too disappointed with the result. As far as I can see, you are comparing an algorithm that you have chosen with certain constraints in mind, against an algorithm that was found to perform well without regard to these constraints.
Trying to be competitive with specialised algorithms on the very lowest rank of a cognitive architecture is not the way to make progress. Find something that works well enough and build on that.
Hi Phille – thanks for the encouragement! I think after some reflection our key takeaway is this insight: ‘[Conv-GNGU]… does not variably distribute representational responsibility between cells like a k-sparse autoencoder does’.
We recently tested a STDP/Oja learning rule that learns complete digit forms due to its Hebbian-like update rule. Like GNGU, it cannot variably distribute representational responsibility between cells. It worked pretty good but topped out at 97.something% – just like GNGU. This further supports my hunch about the cause.
We’re looking to add some online learning capabilities to the k-sparse AE as a way to move forward with more sophisticated architectures.
Two thoughts: Have you considered adding a lateral consensus layer? Maybe that would better distribute representational responsibility. And there is also the possibility that you actually want more specialised cells, that might be advantageous if you add some kind of sequence memory. Anyway, I’m looking forward to your move towards more sophisticated architectures.
To expound on that second thought: The interesting thing about SDRs is that they can be semantically decomposable. But if you optimise for information content you don’t get SDRs and you don’t get semantically decomposable representations. Instead every cell will contribute what it can to every semantic piece of information that might be of interest. So you get dense reps where single cells don’t “mean” anything. The k-sparse AE might just create a truncated version of these un-decomposable reps. The reason why I think that this might be suboptimal for sequence memory is that you need a lot of synapses to get a piece of information out of a rep, if that piece of information is distributed among all active cells.
Phille I think we should have this discussion over at reddit.com/r/BioAGI if you’re up for it? We’ve decided to move our group from Slack (which is private) to Reddit – and you’re welcome to join in!
My intuitition is that the way that backprop (even locally, within a single layer) attributes error and therefore weight changes to individual cells and synapses means that it IS decomposable. This is quite unlike most other local, unsupervised training regimes and I’m just starting to appreciate it. There isn’t much discussion on this in the literature (that I’ve seen)
If you start posting on reddit whenever a new blogpost is up, I’d be happy to comment there in the future. I’m a little annoyed with the non-existent line break functionality on this website anyway … 😉
Yep, the theme is a bit limited! I’ve posted a link to the articles here:
Feel free to continue, and we’ll post future articles on Reddit too.