Distilling Stockfish into a Transformer
My thoughts on the "Grandmaster-Level Chess Without Search" paper from Google DeepMind
I recently came across this paper (Grandmaster-Level Chess Without Search) by Google DeepMind and my interest was sufficiently piqued to warrant a close read. A first skim gave me the impression that the transformer learned to play chess by looking at a very large number of board positions + the corresponding best move. To do this, the researchers fed FEN strings of the boards before and after the move as input and output respectively to the feed-forward transformer network. However, the paper does not claim this at all, and instead does something else entirely, which is to approximate the position-evaluation capabilities of Stockfish 16 (a chess engine that evaluates chess positions using tree search) with a feed-forward transformer network.
It is my opinion that this is not the same as learning how to play chess, but that is a topic for another day. First of all, I’ll just write briefly about what the paper actually does before adding my thoughts to this. The dataset is prepared by randomly choosing board positions from games played on LiChess and using Stockfish to evaluate board positions. Pairs of [positions, moves] are prepared and also evaluated (about a billion or so of these). Three predictors are trained using this dataset. The action-value predictor predicts the evaluation given a position and a move. The state-value predictor predicts the evaluation given a position. The Behavioral Cloning predictor predicts the best move given a position.
This whole process leads to a predictor that gets very close to approximating SF behaviour on board states, both seen and unseen. However, because the model does not use search during inference, it fails in very specific ways. One example discussed in the paper is blindness to impending three-fold repetition, this happens because the model has no record of past moves but treats each board position as an independent sample. The “without search” claim just pertains to the fact that the model doesn’t explicitly search through possible solutions during inference. But a criticism that I have seen and I agree with is that using SF as an oracle (a fancy term which means they use it to annotate their dataset) means that we use the tree-search capabilities of Stockfish directly. And there are cases where the model requires hard-coding a search in order to finish games. Some details about the game are hard-coded into the data, for example, the FEN strings (Q: What would happen if PGN strings were used instead? A: Ask Nicholas Carlini, who created a LiChess bot that does next move prediction with PGN strings) contain information about whether castling is allowed.
The only thing the paper overclaims is about Grandmaster level play. The reported 2895 Elo is achieved via online blitz play and Grandmaster norms are given out for over the board classical chess games. The ELO calculation is also not consistent. The paper uses BayesElo with anchors as the LiChess ratings (which themselves, are calculated via the Glicko2 method). Technical details aside, the paper shows games where the agent misses #3 (checkmate in three moves) because it finds an alternate #5 mating strategy that ends in five moves. Human GMs don’t play this way, and most beginner endgame exercises are aimed at learning and identifying quick mates. I think this follows a general trend of engines resorting to really suboptimal play in poor positions. GMs (or any chess player) don’t suddenly drop their level when the position gets bad.
In conclusion, I liked this paper a lot and I think it is written very well. The paper doesn’t claim that the agent learns the rules of chess, but it is definitely written in a way that suggests it does. The agent distills the evaluation capabilities of Stockfish 16, a tree-search based chess engine, into a transformer. During inference, it outputs an evaluation given a board state. With scale, the model should only match the capabilities of SF16, and I don’t see why it would be better than SF in any way. Practically, I don’t see it deployed anywhere for analysis purposes considering that inference would need GPUs and there wont be any significant new ideas.
One implementation detail that I did not understand initially was the use of value binning. Even though the predictors output a real number (winning percentage), the range (0-100%) is divided uniformly into $K$ bins, with $K$ being a design hyperparameter. This is done only for the action-value and state-value predictors. But I have a theory that value binning is done simply because log-loss works better while training transformer networks even though an MSE feels more intuitive. Another theory is that it is better to bin because a 97% winning percentage is not that different from a 90% winning percentage when it comes to chess. Both positions would be annotated as “completely winning” by an expert chess player.
Anyway, I have spent a lot more time on this than I originally planned to, but I think there is a lot of benefit from reading papers like these, a lot of the heavy lifting behind some of these results is just really good and inspired design choices and after a point, you need to be good at noticing these details from papers rather than just the transformer and the losses. End of discussion. Write to me if you want an annotated copy of the paper.