Executive Control is core to what most people recognise as true intelligence. For example, the ability to attend to relevant cues and maintain task dependent information whilst ignoring distracting details and taking appropriate actions.
Working Memory (WM) is a core component of Executive Control. “Working memory is a short-term repository for task-relevant information that is critical for the successful completion of complex tasks” (Baddeley 2003).
Capabilities are often thought of as distinct modules e.g. WM, executive control, semantic memory, and episodic memory. I’m in favour of the idea that they are emergent properties of the system via interactions between differentially specialised regions (explored in Mansouri et. al 2015). Good models for WM are likely to hold important clues for Executive Control itself, and therefore provide a blueprint for building it into artificial agents.
PBWM – Biological Model
A popular and comprehensive model for WM is PBWM – Prefrontal cortex basal ganglia working memory. It accounts for biological details at a range of scales from the neuron to behaviour at the level of standard cognitive tasks on working memory. The tasks come in various forms. The common element of these tasks is a need to selectively remember previous perceptions and take actions based on them. Some examples include 1-2-AX, Stroop test, and a ‘Store Ignore Recall’ test.
- Prefrontal Cortex (PFC) consists of units that are gated.
- When a gate is open, information can flow in and out, and when the gate is shut, the information is maintained.
- Basal Ganglia/Thalamus (BG-TH) controls the gates in order to complete the task.
- BG learns via reinforcement learning.
PBWM and LSTM from applied ML
LSTM is a standard ML algorithm for time series modelling. It’s a recurrent neural network, but the units have input, output and forget gates. See this blog for a good primer. Gates in LSTM allow selective remembering of parts of the history.
PBWM is considered to be analogous to LSTM, “but more biologically explainable”. The first author of PBWM, O’Reilly, stated that LSTM was the closest comparable algorithm, and Graves (an early contributor to LSTM research) considered LSTM to be ‘biological’ due to similarities in function. I’ll expand on their equivalence shortly.
It’s only ever discussed as a loose analogy, but it’s interesting to delve a little deeper, compare them side by side and to see where it takes us.
In LSTM, gating occurs at the level of the neuron, so we call this the LSTM ‘unit’. It’s assumed that the reader is familiar with LSTM neurons.
In PBWM, the gating occurs at the level of the ‘stripe’ otherwise referred to as a neocortical macrocolumn (old blog article of ours here that gives an overview and its expanded upon later in the article). Furthermore, there are two types, maintenance stripes and output stripes. In PBWM, they are paired. Together, they form the PBWM ‘unit’ with an input and output gate, analogous to the input and output gate of the LSTM ‘unit’.
The input gate allows information into the unit to be maintained, and the output gate routes the information out to have an effect on other areas: back to BG to influence learning and goal setting, motor actions (i.e. make external choices in the task).
Maintenance occurs due to cortico-cortical recurrence and thalamo-cortical recurrence. Activity is maintained collectively in all the neurons of the stripe.
The net input is a weighted sum that results in a scalar value
The input is a vector that is passed into the unit if the input gate is open.
|LSTM and PBWM units side by side, showing the equivalence and differences.
Comparing units directly
The PBWM unit is more distributed compared to the LSTM unit. The gates are located in the BG/Thalamus, and input/output occurs across the pair of stripes. There’s no separate forget gate in PBWM. Another difference is that the LSTM unit corresponds to one scalar value, whereas the PBWM unit holds some world ‘input’ state represented as a sparse vector. This is what it received as input, as opposed to LSTM, where the net input is a weighted sum.
Comparing network of units
Both LSTM and PBWM are made up of layers of units. The LSTM units directly connect to all other units in the layer. In contrast, the PBWM units do not directly interconnect. They do connect via the BG/TH, but only to determine gating.
The comparison is summarised below.
- Both consist of multiple units, that have input and output gates, and can maintain information.
- There is no forget gate in PBWM.
- In LSTM, the gates are switched by recurrent inputs, and in PBWM by the BG/TH.
- LSTM learns through ‘back propagation through time’ (BPTT), PBWM learns using Reinforcement Learning (RL), and only requires a single dimensional reward.
- LSTM units input and output a single scalar value, PBWM a vector.
- LSTM units are fully interconnected in terms of gating and information flow, whereas PBWM units interconnect via the BG for gating only.
Cracks in the analogy?
These look pretty similar at a high level, that’s a nice result …… but hold on, look at the differences! The most striking difference is that the PBWM unit gates a vector, as opposed to a scalar value. That’s because the unit corresponds to a neocortical macrocolumn, which themselves contain a network of neurons. In other words, the ‘gate to representational complexity’ ratio is much higher in PBWM.
Now consider some extensions to PBWM to make it more biologically detailed and call it PBWM 2.0. We’ll start by looking at a few important features of the macrocolumn:
- ~100 minicolumns, each possessing a group of neurons that respond to a common receptive field.
- There is a high degree of interconnectivity within the macrocolumn.
- There is a low degree of interconnectivity between macrocolumns.
- However, there are clusters of macrocolumns that have a higher degree of interconnectivity within the cluster (mentioned but not explicitly modelled in PBWM).
There is good evidence that the minicolumns collectively model sequences (see here from Numenta, and a recent article by Max Bennett), which goes beyond the simple PBWM definition of a macrocolumn. Therefore, the unit of functionality is sequence modelling.
PBWM 2.0 is gating whole sequences. Therefore, it’s a form of hierarchical time-series modelling, or put another way, a model of sequences of sequences, within a single cortical layer (together with BG-TH).
Let’s call the higher level sequences, the ‘outer model’ and the finer grained sequences the ‘inner model’. It is even easier to see how PBWM 2.0 may build the outer model, if you explicitly include the direct interconnectivity between stripes in a cluster.
We’ve done work on a biologically plausible time-series modelling algorithm RSM that is analogous to a macrocolumn here and here. And we have a Request for Research project (Using Reinforcement Learning to discover attentional strategies) that combines RSM with RL attentional filtering, which is beautifully congruent with the ideas of PBWM.
There are variations of hierarchical LSTMs. It’s common practice to stack LSTM layers, and there are nested LSTMs where each gated neuron contains another gated neuron (Moniz et. al 2018). A single nested LSTM neuron is depicted below followed by a comparison of the topology of different hierarchical architectures, both reproduced from Moniz.
Isn’t this like one of these hierarchical LSTMs? Maybe, but there are material differences. The topology is different to Stacked or Nested recurrent networks like the LSTM versions above. In addition, this PBWM style architecture does not require BPTT or labels. In summary:
- The outer model is trained with RL
- The state can be maintained
- The inner model is unsupervised
- Does not have a known ability to hold state
Nothing in the real world is static, so maybe it is just that the ‘states’ in WM are in effect sequences, and this machinery makes that possible. But it could be much more. The power of gated neurons is evident in the success of LSTM. Selectively attending to past sequences stands to be even more effective.
A lot more thinking needs to be done. Can we make a hypothesis for the emergent behaviour? How does that match behavioural observations? and more … But it’s intriguing from a PBWM and ML perspective.
The added complexity of the stripes as sequence predictors may have behavioural implications not accounted for in the current PBWM. My intuition is that these differences are more important when you consider the model accounting for executive control in general, as opposed to a strict definition of WM as a task specific buffer.
Does this give us any new inspiration for ML architectures? I vote yes. It’s a novel architecture based on the prefrontal cortex, a central component of the most intelligent systems we know.
Other researchers have looked at the comparison of WM and LSTM. Pulver et al. 2017 identified these limitations of LSTM:
- “The memory-cell value decays exponentially due to the forget gate. This is important, since the network has a finite memory capacity and the most recent information is often more relevant than the older information. It may, however, be useful to decay that information in a more intelligent manner.
- Another point is that the information in the memory cells cannot be used without releasing it into the outer recurrence.”
They made modifications to LSTM, which makes it more like the gated WM of PFC, closer to PBWM. So it may be a closer fit for the ‘outer model’.