Why is Go hard?
Go is hard because the search-space of possible moves is so large that tree search and pruning techniques, such as those used to beat humans at Chess
, won’t work – or at least, they won’t work well enough, with a feasible amount of memory, to play Go better than the best humans.
Instead, to play Go well, you need to have “intuition” rather than brute search power: To look at the board and spot local (or gross) patterns that represent opportunities or dangers. And in fact, AlphaGo is able to play in this way. It beat the next best computer algorithm “Pachi” 85% of the time without any tree search – just predicting the best action based on its interpretation of the current state. The authors of the AlphaGo Nature paper say:
“During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play.”
How does AlphaGo work?
AlphaGo is trained by both supervised
and reinforcement learning
. Supervised learning feedback comes from recordings of moves in expert games. However, these are finite in size and used naively, would lead to overfitting
Instead, in AlphaGo a Supervised Learning deep neural network learns to model and predict expert behaviour in the recorded games, via conventional deep learning techniques. Then, a reinforcement learning network is used to generate reward data for novel games that AlphaGo plays against itself! This mitigates the limited size of the supervised learning dataset.
Of course, AlphaGo also wants the play better than the best play observed in the training data. To achieve this, the reinforcement learning network is further trained by playing pairs of them (networks) against each other – mixing the pairs up to prevent policies overfitting each other. This is a really clever feature because it allows AlphaGo to go beyond its training data.
Note also that the neural networks cannot possibly fully represent a sufficiently deep tree of board outcomes within their limited set of weights. Instead, the network has to learn to represent good and bad situations with limited resources. It has to form its own representation of the most salient features, during training.
The neural networks function without pre-defined rules specific to Go; instead they have learned from training data collected from many thousands of human and simulated games.
AlphaGo is an important advance because it is able to make good judgments about play situations based on a lossy interpretation in a finitely-sized deep neural network.
What’s more, Go wasn’t simply taught to copy human experts – it went further, and improved, by playing against itself.
So, what doesn’t it do?
The techniques used in deep neural networks have recently been scaled to work effectively on a wide range of problems. In some subject areas, narrow AIs are reaching superhuman performance
. However, it is not clear that these techniques will scale indefinitely. Problems such as vanishing gradients
have been pushed back, but not necessarily eliminated.
Much greater scale is needed to get intelligent agents into the real world without them being immediately smashed by cars or stuck in holes. But already, it is time to consider what features or characteristics constitute an artificial general intelligence (AGI), beyond raw intelligence (which AIs now have).
AlphaGo isn’t a general intelligence; it’s designed specifically to play Go. Sure, it’s trained rather than programmed manually, but it was designed for this purpose. The same techniques are likely to generalize to many other problems, but they’ll need to be applied thoughtfully and retrained.
AlphaGo isn’t an Agent. It doesn’t have any sense of self, or intent, and its behaviour is pretty static – its policies would probably work the same way in all similar situations, learning only very slowly. You could say that it doesn’t have moods, or other transient biases. Maybe this is a good thing! But this also limits its ability to respond to dynamic situations.
AlphaGo doesn’t have any desire to explore, to seek novelty or to try different things. AlphaGo couldn’t ever choose to teach itself to play Go because it found it interesting. On the other hand, AlphaGo did teach itself to play Go…
All in all, it’s a very exciting time to study artificial intelligence!
by David Rawlinson & Gideon Kowadlo