Reproducibility, Metamorphic Testing, and the Subtleties of Assessing Inconsistencies in AI Systems
This is not just another blog post about chess—or at least, not only about chess. While the setting involves Stockfish, the world’s strongest open-source chess engine, the real discussion here is broader:
How do we carefully assess inconsistencies in complex AI systems? When an AI model—or a highly optimized program—seems to violate fundamental expectations, how do we tell the difference between a genuine bug and an artifact of the evaluation setup?
These questions brought me to the paper Metamorphic Testing of Chess Engines (IST, 2023). The authors applied metamorphic testing (MT)—a well-known software testing technique—to Stockfish and found numerous evaluation inconsistencies, suggesting that Stockfish might be flawed in ways that had gone unnoticed.
But one issue made me pause: the study was conducted at depth=10.