AlphaZero is an intriguing new chess evaluation function. It is not a full chess engine because it does not do time management, a significant component of a full chess engine.
Here are some recommendations on how to compare a new evaluation function against an existing chess engine such as Stockfish. This was inspired by how poorly the AlphaZero paper did it; these would be improvements:
- Record and publish the exact version of Stockfish used, including compilation options.
- Set Stockfish to use 1 thread only and a search depth of a fixed number of nodes per move (the "go nodes" UCI protocol command). This allows others to exactly replicate Stockfish's evaluation. (Using only 1 thread only avoids the nondeterminism of multithreading.)
- Adjust the level of Stockfish (number of nodes per move) or the level of your new evaluation function (assuming yours is tunable) so that games played between them result in about 50% score: both sides achieve approximately the same number of wins as losses. If the new evaluation function is very strong, Stockfish might have to be given a huge time advantage to achieve a 50% score. A 50% score allows highlighting strengths and weaknesses of both sides.
- Report the strength difference between your evaluation function and Stockfish as the difference in time consumed by each side in achieving the 50% score.
- If you want to show off how strong your evaluation function is, and your evaluation function is tunable, repeat tuning your (presumably stronger than Stockfish) evaluation function for a 75% score, keeping Stockfish's settings the same.
No comments:
Post a Comment