Skip to content

Literature Review: ‘A Distributional Perspective on Reinforcement Learning’

This article assesses the research paper, ‘A Distributional Perspective on Reinforcement Learning’ by the authors, Marc G. Bellemare, Will Dabney and Remi Munos, published in the proceedings of the 34th International Conference on Machine Learning (ICML) in 2017. Bellemare et al.’s paper will be assessed on several criteria. Firstly, content is assessed through providing some background information and describing the methods and findings of the paper. Secondly, the novelty and innovation of the paper is described. Thirdly, the technical quality is examined. Finally, possible applications for the research work and suggested improvements are identified.


Bellemare et al.’s paper is well-researched and offers new insights in the field of reinforcement learning. It shows the approach the authors undertook to demonstrate that modelling the variation in the reward value, rather than the typical average value, results in improved accuracy and training performance in reinforcement learning systems. This section provides background information about reinforcement learning and the contents of the paper. It also examines the methods used by the authors, and their findings.


Reinforcement learning (RL) is an area of machine learning that is inspired by psychological and neuroscientific perspectives on animal behaviour [3]. It is the problem of getting an agent to act in an environment in such a way as to maximise its rewards, by predicting the long-term impact of its actions. Unlike supervised learning, the agent is not given labelled data to indicate whether their action is correct for a specific scenario [4]. In a typical reinforcement learning system, the algorithm predicts the average reward value it receives from multiple attempts at a task [2].

Bellemare et al. argue that ‘randomness is something we encounter everyday and has a profound effect on how we experience the world’, and that these variations should be accounted for when designing algorithms [2]. They show that it is possible to model these variations in the reward the agent receives, termed value distribution. The authors demonstrate that there is a variant of Bellman’s equation (Figure 1) which can predict the possible outcomes without aggregating them into an average value, allowing the agent to ‘model its own randomness’ [2].

Figure 1: Bellman’s equation describes the value (Q) in terms of expected reward and the expected outcome of the random transition (x, a) → (X′, A′), and the discount factor (?) determines the importance of future rewards (Bellemare et. al 2017A)


Bellemare et al. began by providing some background information on value distributions and how distributional perspectives are currently being applied to other works, and the benefits of it applied to reinforcement learning. The authors then evaluate the theoretical results of the distributional equations and how they can be applied. Based on the theoretical results and other works, they propose and implement a new algorithm based on the distributional variant of Bellman’s equation, where the average reward value output of a Deep Q-Network (DQN) agent is replaced with a distribution of possible values, or atoms, which can be adjusted. Bellemare et al. assess the performance of the algorithm against Atari 2600 games in the Arcade Learning Environment (ALE). An optimal number of atoms is found, 51, by varying the number of atoms and evaluating the training performance of the algorithm (Figure 2). The performance is also evaluated against a typical DQN agent, where it attained state-of-the-art performance [1].

Figure 2: Varying the number of atoms in the distribution in a number of games, showing the average scores are improving (Bellamere et al. 2017A)

The figures below visualise the typical value distributions observed in Bellemare et al.’s experiments. The value distributions demonstrate how the agent determines the safe actions from the losing actions, where safe actions have similar distributions and the losing actions are assigned low or zero probability. For example, Figure 3 shows that the agent assigned a probability of zero to the three actions that will lead to the agent losing the game by firing their laser too early.

Figure 3: Agent playing the Atari 2600 game, Space Invaders. It shows a typical learned value distribution where the different colours indicate the different actions used in the game. Noop means no operation (Bellamere et al. 2017A)

Figure 4: Agent playing the Atari 2600 game, Q*bert. Top, left and right: Predicting which actions are not recoverable. Bottom-Left: The distribution shows costly consequences for performing the wrong actions. Bottom-Right: The distribution shows the agent has made a big mistake (Bellamere et al. 2017A)


Bellemare et al. state that ‘the distributional update keeps separated the low-value, “losing” event from the high-value, “survival” event, rather than average them into one (unrealizable) expectation’, which shows why their approach is more successful [1]. Bellemare et al. achieved state-of-the-art results using the 51-atom agent (C51) across the suite of Atari 2600 games. They found that it significantly outperformed other algorithms, such as the DQN agent, and surpassed the current state-of-the-art results by a large number. Figure 5 shows that training performance of the C51 surpasses the performance of a fully trained DQN and a human player by a wide margin. It achieved 75% of a trained Deep Q-Network performance in 25% of the time [1].

Figure 5: Performance comparison between the new algorithm (C51), the original DQN and human at the set of Atari 2600 games (Bellamere et al. 2017A)

They also observed surprising randomness in the Atari 2600 games, although the underlying emulator, Stella, is completely predictable. This inherent randomness is attributed to partial observability, where the agents cannot accurately predict when their score will increase. These findings highlight the limitation of the agent’s understanding; however, it does not directly affect the performance.

Bellemare et al.’s new algorithm also managed to exceed the performance of other state-of-the-art algorithms, such as Double DQN, by a wide margin and achieved remarkable results in a gamut of Atari 2600 games. The C51 agent obtained a mean score improvement of 126% and a median of 21.5% on a normalised scole scale with respect to random and DQN agents, confirming the advantages and benefits of C51 and the value of the distributional perspective [1].

Figure 6: The per-game improvement percentage of the C51 agent compared to a Double Deep Q-Network agent (Bellamere et al. 2017A)

In addition to outperforming a standard Deep Q-Network (DQN) agent, the C51 agent showed significant improvement over a Double DQN agent as well. Double DQN is an improvement upon the original Deep Q-Network which uses Double Q-Learning to reduce the agent’s overestimation of values, achieving state-of-the-art results against the DQN [5]. Figure 6 shows a significant percentage improvement of the C51 agent against the Double DQN agent per Atari game.

Bellemare et al.’s paper provides evidence that the distributional perspective leads to improved performance and more stable reinforcement learning. This shows that the distributional perspective is more powerful than expected on the tested range of problems, and thus it is likely that it will be beneficial to other areas in machine learning.


Although a distributional perspective is not in itself a novel idea, it has only been used for specific purposes in reinforcement learning. Bellemare et al. note previous works that used a distributional perspective. Dearden et al. (1998) modelled parametric uncertainty and Morimura et al. (2010) designed risk-sensitive algorithms. Bellamere et al. believe it has an important role to play in reinforcement learning algorithms. Their findings demonstrate that applying a distributional perspective leads to improved performance in reinforcement learning. They stated that ‘it might just be the beginning for this approach’ and that there’s a possibility that ‘every reinforcement learning concept could now want a distributional counterpart’ [2].

Bellemare et al.’s work offers significant contributions to the field of reinforcement learning. They showed that it is possible and favourable to predict the potential outcomes rather than simply average them, by using a variant of Bellman’s equation. The implementation of their ideas did not require substantial changes to the existing Deep Q-Network architecture, substituting the average reward value output with a distribution of 51 possible values. As well as updating the learning rule to reflect the transition to the distributional counterpart of Bellman’s equation. This new architecture is called Categorical DQN.

Technical quality

Bellemare et al. provided supplementary material about the algorithm design, and figures about the evaluation and results. They included additional details and proofs of the distributional variant of Bellman’s equation, and supplementary videos showcasing the change in reward distribution as the agent is training across different games such as Space Invaders, Pong and Seaquest. Although the authors did not provide the source code, the algorithm details, equations, proofs and pseudo code in the paper are sufficient to implement the new algorithm by amending a DQN agent to use a value distribution for reward outcomes. Although the paper is relatively new, several implementations of the paper using Tensorflow, a machine learning library, and OpenAI Gym, a learning environment that includes wide range of games including Atari, have been published on GitHub to replicate the results; however, at the time of writing, the results are incomplete.

Bellemare et al. conducted various experiments on the new algorithm and evaluated it against a typical Deep Q-Network (DQN) agent to compare their performance. They also adjusted the number of atoms in the new algorithm in an attempt to find the optimal number of atoms in the distribution. The algorithm exceeded the performance of the DQN agent and achieved state-of-the-art results in a number of classic Atari 2600 games.


Reinforcement learning has a wide range of applications. It has been in use for decades in playing games such as backgammon and checkers, robotics, elevator dispatching strategies and job scheduling [4]. More recently, DeepMind has introduced deep reinforcement learning by incorporating deep neural networks with Q-learning, a model-free reinforcement learning technique, to create a novel artificial agent, termed Deep Q-Network (DQN) to learn successful policies from high-dimensional inputs, such as pixels [3]. The DQN agent achieved state-of-the-art results compared to previous algorithms, and reached human-level performance across a set of 49 classic Atari 2600 games [3].

Bellemare et al.’s contributions further develop deep reinforcement learning by replacing the Deep Q-Network agent’s average reward outcome with a distribution of reward values. Any existing reinforcement learning algorithm can be revised with a distributional perspective, in order to achieve improved accuracy and performance [2]. They also argue that ‘predicting the distribution over outcomes also opens up all kinds of algorithmic possibilities’. Appropriate action can be taken if the data observed is bimodal, i.e. take on two possible values. For example, using the train commute times, we can then check for train updates before leaving home to maximise the outcome. In addition to that, by modelling the distribution we can identify the safe choices when two of the choices have the same average value, by favouring the choice that varies the least. Predicting multiple outcomes has also been shown to improve the training performance of deep networks [2].

Bellemare et al.’s work opened up additional possibilities and may just be the beginning of this new approach. The research work can be improved further by applying a distributional perspective to other algorithms in machine learning, where a multitude of outcomes may be more beneficial than an average outcome. Improvements in performance will highlight the necessity of taking randomness into consideration when designing algorithms in general. In addition to achieving better performance in this learning environment, it would be interesting to see the research work applied to a real world problem where the impact of the improved performance can be realised.


Bellemare et al.’s paper demonstrated the importance of accounting for randomness in algorithm design and how it can be implemented in reinforcement learning algorithms. It is recommended that those researching reinforcement learning refer to this paper for information on designing algorithms with a distributional perspective for better and more reliable reinforcement learning.


[1] Bellemare, M.G., Dabney, W. & Munos, R. 2017A, ‘A Distributional Perspective on Reinforcement Learning’, International Conference on Machine Learning 2017, PLMR, Sydney, Australia, pp. 449-458.

[2] Bellemare, M.G., Dabney, W. & Munos, R. 2017B, ‘Going beyond average for reinforcement learning’, DeepMind News & Blog, weblog, DeepMind, London, UK, viewed 15 August 2017, <>.

[3] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. 2015, ‘Human-level control through deep reinforcement learning’, Nature, vol. 518, no. 7540, pp. 529-533.

[4] Sutton, R.S. & Barto, A.G. 1998, Reinforcement learning: an introduction, MIT Press, Cambridge, MA.

[5] van Hasselt, H., Guez, A. & Silver, D. 2015, ‘Deep reinforcement learning with double Q-learning’, AAAI Conference on Artificial Intelligence, AAAI Press, Phoenix, Arizona, pp. 2094-2100.

Leave a Reply