Video Prediction using Recurrent Sparse Memory

We recently presented 2 papers at the International Joint Conference on Neural Networks, IJCNN. The first one is about one-shot learning for the long term (with an artificial hippocampal algorithm), blog hereIn this blog article, we are excited to share the other paper –  Learning distant cause and effect using only local and immediate credit assignment .

The paper first introduced the Recurrent Sparse Memory (RSM) architecture back in May 2019, and we had previously blogged on how RSM learned to predict higher-order partially-observable sequences of handwritten digits. The unique feature of RSM is that it uses only local and immediate learning rules – it doesn’t require backpropagation through time or layers. Despite this, it is competitive with deep learning neural networks on several benchmarks. 

Since then, we had worked on an improved variant of RSM in collaboration with Numenta, which significantly improved performance on the language modelling task and introduced a more complex higher-order sequence benchmark. This work was published in Artificial Neural Networks in Pattern Recognition (ANNPR) and From Neuroscience to Artificially Intelligent Systems (NAISys) in 2020.

One of the most exciting aspects of this project was utilising the generative capability of RSM and expanding it to spatiotemporal problems, such as multi-step video prediction and sequence generation. This is an important characteristic for  mental simulation, which is a long-term goal for an agent-based model with such capabilities.

In this article, we’ll shed some light on the multi-step video prediction problem in the paper, specifically on the “bouncing balls” benchmark [1], and outline future directions for this research. Those directions include increased spatial complexity, multi-frame (self-looped) sequence generation and attentional mechanisms.

Frame by frame results on the bouncing-balls task. The input is a simulation of 3 billiard balls bouncing in a box. The red channel shows RSM predictions, and the blue channel shows ground-truth simulation. Magenta (=Red+Blue) implies correct prediction. The top row shows RSM in predictive mode from the start of a sequence. Prediction errors are imperceptible after the 5th step. Bottom row: RSM+GAN in self-looped mode- although the dynamics are believable, the trajectories do diverge from the simulation.

Background – “Bouncing Balls” task

The bouncing balls task was first introduced by Sutskever et al. [1] and it consists of a physics simulation of three billiard balls bouncing around in a box, interacting with each other and the surrounding walls. The standard simulation lasts for 100 frames, and each frame is a 30×30 grayscale image (i.e. 900-dimensional input). The balls vary in direction and speed, transferring momentum to each other. No prior information about the physics model is provided to the algorithm – it must learn the dynamics simply by observing the video. 

The use of this task as a benchmark was motivated by the surprisingly poor results achieved by artificial neural networks to-date on video prediction tasks. Lotter et al. [2] created a model called PGN (Predictive Generative Network) and applied it to the bouncing balls task. They contrast with natural language processing success: “Generating realistic samples for high dimensional images, particularly predicting the next frames in videos, has proven to be much more difficult.” 

But predicting one frame ahead is not sufficient for many of our goals, such as long-distance planning conducted entirely in mental simulations of the world, without the need for constant adjustment by contact with reality. To satisfy these requirements, algorithms must learn to produce convincing predictions many steps ahead in self-looping mode. Cenzato et al. comment: [5]

“… We show that most of the models indeed obtain high accuracy on the standard benchmark of predicting the next frame of a sequence, and one of them even achieves state-of-the-art performance. However, all models fall short when probed with the more challenging task of generating multiple successive frames. Our results show that the ability to perform short-term predictions does not imply that the model has captured the underlying structure and dynamics of the visual environment, thereby calling for a careful rethinking of the metrics commonly adopted for evaluating temporal models.”

Cenzato et al. apply a number of convolutional LSTM architectures to the bouncing balls task.


We trained a convolutional, stacked RSM on the bouncing balls video-prediction task. The video shown above compares the learned dynamics in generative, self-looped mode against videos already available from Sutskever et al [1]. RSM predictions are shown throughout the sequence, but RSM is primed for 50 frames with the real simulation before swapping to self-looped mode for 150 frames. The dynamics generated by RSM are clearly more “Newtonian”, with better conservation of momentum and more correct departure angles after interactions between balls and walls. In the video generated by the earlier work, the “balls” tend to stick to each other like cells under a microscope, and sometimes wander or start and stop. 

Quantitatively, we present next-frame Mean-Square-Error in next-frame prediction mode. A large number of results are available for this test condition. We followed a similar training regime as the original paper, where the sequences are generated on-the-fly. The test scores are reported on a fixed set of 200 sequences to provide a similar comparison with other methods.

PGN [2]0.65 ± 0.11
DTSBN [4]2.79 ± 0.39
SRTRBM [3]3.31 ± 0.33
RTRBM [1, 3]3.88 ± 0.33
LSTM [5]111.09 ± 0.68
ConvLSTM [5]0.58 ± 0.16
Seq2seq ConvLSTM [5]1.34 ± 0.19
Seq2seq ConvLSTM multi-decoder [5]4.55 ± 0.40
Frame t-1 [2]11.86 ± 0.27
Our Results
RSM 2L3.72 ± 0.30
RSM 1L + GAN0.80 ± 0.07
RSM 2L + GAN0.41 ± 0.05
Frame t-111.82 ± 0.27
RSM next-frame prediction error compared to previously reported results on the bouncing-balls video prediction task. The two-layer RSM with GAN rectifier improves on all prior results, despite RSM


Given that ball appearance and interaction dynamics are the same wherever they occur, we used RSM in convolutional mode. We found that two layers yielded better dynamics than one, especially in self-looped mode. The prediction of the next frame is output by the lower RSM layer. The upper RSM layer only influences the lower RSM layer, and does not directly drive the output. All RSM layers use local learning rules.

We added a Generative Adversarial Network (GAN) to “improve” the predictions output by RSM. The reason this is necessary is that RSM predicts “average” pixel values for the next frame, covering all the uncertainty. When fed back in, these “smoothed” predictions become increasingly uncertain (i.e. blurred appearance) and then break down – the training regime is simply not well suited to self-looped operation.

In natural language processing, a stream of generated text may be produced from a prediction by choosing the next word randomly, in proportion to words’ probability in the prediction distribution. We don’t feed in interpolated words that are a mix of words! We pick just one. 

We initially tried using image processing filters to “sharpen” the RSM predictions and observed a dramatic improvement in self-looped condition. But it’s difficult to achieve the desired results with simple filters. This inspired us to use a GAN to “sharpen” the RSM predictions – producing high quality samples is what GANs are good at! Since we feed in the RSM prediction as the GAN “conditioning” input, and the GAN has no history or other context, the prediction is entirely determined by RSM. The GAN acts as a rectifier of the prediction, transforming it from pixels’ expected values over all futures into a single, sharp sample.


The architecture consists of two parts: a recurrent RSM component and a GAN component. Training happens in two phases. The RSM component is first trained on the sequences, and then used in inference mode during the training of the GAN component.

RSM + GAN architecture. See paper for notation and details. The GAN is used as a “rectifier” to generate a single specific sample from the “average” next-frame prediction generated by RSM. This combination of RSM + GAN seems particularly effective for generating long video sequences. 

Recurrent Component

The recurrent component consists of two convolutional RSM layers with bi-directional and recurrent connectivity. The RSM layers are trained independently and simultaneously. No gradients flow between layers, with the second layer’s hidden state integrated as feedback in the first layer. The objective for the RSM is to predict the next frame in the sequence, given the current frame and the recurrent state. The predicted frame is then provided to the Generative Component as the input to the generator. Gradients do not flow between the recurrent and generative components.

Generative Component

The generative component consists of a Generative Adversarial Network (GAN) which is trained independently from the recurrent component. Unlike the RSM, gradients flow between the layers of the generative component, following the standard GAN training framework.

The generator is a convolutional autoencoder, consisting of 3 encoder layers and 3 decoder layers. The encoder layers use LeakyReLU activation function, kernel size of 5×5, filter sizes of (64, 128, 256) respectively, and strides of (1, 2, 1) respectively. The decoder layers use a kernel size of 5×5, filter sizes of (128, 64, 1) respectively and strides of (2, 1, 1) respectively. The decoder layers also use LeakyReLU activation function, except in the output layer, which uses a sigmoid activation function. The discriminator consists of a fully connected layer (128 units) and LeakyReLU activation function, and a fully connected output layer with a sigmoid activation function.

We follow the similar training regime and parameterization to PGN paper [2]. The generator is given the prediction from the RSM and trained using combined mean-square-error and adversarial losses. The hyperparameter lambda controls the influence of the adversarial loss.

total_gen_loss = gen_mse_loss + lambda * gen_adv_loss


Although this is a relatively simplistic benchmark, RSM’s success here is exciting. While we demonstrated state of the art performance in predictive mode, the self-looped dynamics are particularly satisfying and convincing. 

Cenzato et al. [5] suggest an extension to the bouncing balls task involving a 30-step prediction. We considered this, but ultimately felt that it is not an appropriate measure because ball positions diverge rapidly given only tiny trajectory errors. It would be better to measure the plausibility of the generated physical model – for example, conservation of energy – rather than the precision of alignment to a particular sequence.

From here, we aim to scale up this video-prediction system to simulations of arbitrary, high-dimensional data streams. We also hope to replace the GAN with a more “biologically plausible” component – the key seems to be collapsing a distribution over all possible futures, to a single sample from that distribution.

Longer term, we aim to add attentional filtering to this architecture. This will allow self-looped predictions to focus on particular events and features in isolation. Our eventual goal is planning in hierarchical mental simulations learned from high-dimensional data.


  1. I. Sutskever, G. E. Hinton, and G. W. Taylor, The recurrent temporal restricted boltzmann machine, in Advances in neural information processing systems, 2009, pp. 1601 1608.
  2. W. Lotter, G. Kreiman, and D. Cox, Unsupervised learning of visual structure using predictive generative networks, arXiv preprint arXiv:1511.06380, 2015.
  3. R. Mittelman, B. Kuipers, S. Savarese, and H. Lee, Structured recurrent temporal restricted boltzmann machines, in International Conference on Machine Learning, 2014, pp. 1647 1655.
  4. Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin, Deep temporal sigmoid belief networks for sequence modeling, in Advances in Neural Information Processing Systems, 2015, pp. 2467 2475.
  5. Cenzato, A. Testolin, and M. Zorzi, On the difficulty of learning and predicting the long-term dynamics of bouncing objects, arXiv preprint arXiv:1907.13494, 2019