MuZero: A new revolution for Chess?
DeepMind has defeated me again: I’ve spent another sleepless night trying to understand the technical and philosophical implications of MuZero, the successor of AlphaZero (itself successor of AlphaGo). What is impressive and novel is that MuZero has no explicit knowledge of the game rules (or environment dynamics) and is still capable of matching the superhuman performance of AlphaZero (for which rules have been programmed in the first place!) when evaluated on Go, chess and shogi. MuZero also outperformed state-of-the-art reinforcement learning on Attari games, without explicit rules. In this blog post, I want to share some reflections, notes, and then specifically discuss the possible impacts of MuZero on Chess.
MuZero sounds like a new breakthrough for artificial intelligence (preprint is available). The work is much more general than AlphaZero and is not limited to two-player games with undiscounted terminal rewards as in Chess. The applicability is broader: in chess you win/lose/draw at the end of the game, but in many real-world situations you have intermediate rewards (as in Atari games). The key resides in a new approach and architecture in which the knowledge of the rules of the game is not “harcoded”: MuZero learns itself the rules through self-play. Rules of the games are not directly programmed into the search tree and MuZero manages its own, learned representation.
At first glance, the absence of explicit rules is a bit counter-intuitive for Chess: after all, why depriving the knowledge of the rules? Rules (such as all possible moves in a given position) are quite easy to specify and implement… and without such rules a program can play (lots of) illegal moves. So why? The first reason is to have a more general solution (as previously evoked), beyond board games and Chess: knowing the environment dynamics is simply impossible in real-world domains like robotics, industrial control, intelligent assistants, or video games. The work thus opens new applications, beyond Chess. Stochastic transitions are not yet handled (roughly speaking games or situations in which uncertainty and randomness are part of the problem), but the results suggest an impressive potential.
The second reason is that MuZero masters its own learning representation. MuZero can internally organize the rules that lead to most accurate results (eg in the search tree). The results on Go suggest “a deeper understanding of the position” than with AlphaZero. In a sense, the knowledge and bias of humans/experts about rules and dynamics is eliminated. A retrospective look at AlphaZero shows that many (human) decisions have been taken to manage and organize the rules. Hence an idea is to let MuZero do the right job and globally optimize the whole reinforcement learning process, including the representation function.
Another quality of the approach is that MuZero brings more flexibility and is perhaps easier to implement: developers of reinforcement learning-based systems do not have to build a complete simulator or think about specific details about rules. A promising direction is that it seems possible to quickly change the targeted rules/dynamics when a new problem arises (though it would require additional training). The experience with AlphaZero showed that developers’ knowledge was actually essential to the success: many tweaks, (hyper-) parameters, and domain-specific decisions have been somehow incorporated. It is not a trivial task at all: for example, Leela developers struggled to replicate the original experiment and reported on differences in terms of parameters and neural network architecture https://blog.lczero.org/2018/12/alphazero-paper-and-lc0-v0191.html. With MuZero, the human knowledge is lifted to another level of abstraction: experts don’t tune the search tree, but focus on other problems (e.g., number of simulations per search, encoding of dynamics function). It would deserve a specific assessment, but it might be much easier to engineer such learning-based systems.
The scenario in which you “just” have to program a teacher/oracle to obtain a superhuman player comes closer: it’s impressive. Yet, human knowledge is still needed and fundamental. As documented in the MuZero paper, parts of solutions are game specific and require adding prior knowledge. For example, “ In Go and shogi we encode the last 8 board states as in AlphaZero ; in chess we increased the history to the last 100 board states to allow correct prediction of draws.” There is a nice pseudo-algorithm provided (very good practice!) that gives the general idea in an elegant way: nevertheless I’m suspecting there are many technicalities to handle before obtaining the final solution. At some point of the paper, it is mentionned that “an image of the Go board” can be given as input. I agree it can lead to nice demonstrations, but we do not have to forget that the image should properly be encoded. Hence we are not yet in the self-supervised situation in which an agent can observe the world, process instructions of a teacher, and magically encode everything to become world chess champion. MuZero does not claim and target such scenarios yet we are more and more closed to a comprehensive end-to-end learning.
Back to reality and to a more concrete case: What’s the impact of MuZero on Chess? On the one hand, I’m tempted to answer “little impact”. The paper mostly mentions Go results, and the “only” insight we get is that MuZero reaches the same level as AlphaZero (Figure 2 of the paper). It is of course a very good result, but MuZero does not outperform AlphaZero. Furthermore, there are not much details about its quality play or how many games have lead to draw/win. I have been a bit frustrated here, though I must admit technical details about chess are reported. My impression is that computer engines’ already play closed to optimal while most of the games lead towards draw: the Elo difference will likely be negligible. AlphaZero was a true revolution for chess players with the re-discovering of theoretical chess openings and the use of an agressive, brilliant yet effective style play. In a sense, it is hard to make a new revolution after a revolution.
On the other hand, MuZero offers interesting perspectives for Chess:
- It is a confirmatory study one can apply reinforcement learning and monte carlo tree search to reach state-of-the-art level. It might even be easier to re-implement a super chess engine. I don’t know the current status of Leela, but I’m expecting some advances there based on insight of the paper. It can speed up the release of learning-based chess engines.
- Chess960: A nice feature of MuZero is that rules can be easily changed. Chess960 is an interesting variant of chess: the initial position is randomly chosen, castling rules differ, the rest is similar (more details here: https://en.wikipedia.org/wiki/Fischer_random_chess). It seems “easy” to train MuZero and gets, perhaps, a super engine at Chess960: no need to modify the search tree or employ specific heuristics. There is a variant of Stockfish that supports Chess960 (with manually defined heuristics): it can be a good baseline to confront. This super engine can be a revolution for Chess960: MuZero can quickly find new chess openings for any initial position. I’m expecting here a breakthrough in Chess960 theory through new strategies to open the game: chess grand masters can hardly study all initial positions and Chess960 is less studied than Chess. For a long time, I also have the intuition that some initial positions are unfair and lead to (stronger) advantages for White pieces: maybe MuZero can provide some evidence here. Interestingly, Chess960 can be more challenging than Chess for MuZero since there are 960 possible (random) positions to start: a bit stochastic and computationally expensive if you consider one starting position at a time. Ideally MuZero will be able to “transfer” its knowledge out of a sample of starting positions. We can also confront AlphaZero with MuZero on Chess960. Another interesting scenario is to take MuZero trained for chess and see how fast it can be transfered/adapted to Chess960.
- Chess variants: in general, there are many variants of chess rules such as CrazyHouse, Antichess, etc. Some chess players have found unique strategies and can outperform strong chess players not familiar with such variants. MuZero’s approach is well-suited to plug new rules and perhaps it will show us novel strategies and styles we didn’t think of. An incredible breakthrough would be to master any chess variant thanks to a super transfer mechanism.
MuZero is yet another tour de force for DeepMind. No explicit rules and self-play with smart sampling: elegant, simple, and even freaking! The potential of MuZero is now more outside Chess, since the approach is much more general and flexible. It’s hard to say, but MuZero’s results have certainly reduced Chess to a “common” benchmark. The good news is that MuZero will accelerate the maturation and understanding of learning-based chess engines. I have also sketched some challenges MuZero could tackle (e.g., exceptionnaly playing at any chess variant such as Chess960). It’s a truly exciting time, kudos to DeepMind!
PS: I am not an expert in model-based reinforcement learning, I have given a candide opinion based on my knowledge and understanding of chess and computer science. Feel free to clarify and discuss!