Friday, October 20, 2006

Computer Chess960 tournament

I ran a computer Chess960 tournament between 5 engines that came with Arena 1.1 (AICE was not included because it played so poorly.) This is an example of how I believe computer chess tournaments should be done.

I ran 30 games between each pair of programs, starting from different Chess960 positions. I rank the programs by how well they did against their worst enemy, that is, by their winning percentage (draws are 0.5 win) against the opponent against whom they performed the poorest. Thus, Spike is the worst enemy of Betsy, Chispa, and Frenzee, with winning percentages of 27%, 18%, and 27% respectively. Spike's worst enemies are Betsy and Frenzee with a winning percentage of 73%. However, it might be argued that Spike's worst enemy is itself, and it's tautological winning percentage (not measured) is 50%. In either case, the ranking of the programs is Spike, followed by a tie between Betsy and Frenzee, and in last place Chispa.

I am aware that several of these programs have new versions available, the purpose of this tournament is demonstrate the worst-enemy ranking system.

This worst-enemy ranking is inspired by game theory, where an agent chooses a best-response strategy against its opponents. (And when all agents are simultaneously in best response to all other agents, they are in Nash equilibrium.) The worst-enemy of a program is an approximation to the best-response against that program. By ranking by performance against worst-enemy, as opposed to total (equivalently average), performance against all the other programs, one eliminates the effect that one program gets ranked higher than another because it on average stomped on some of the weaker programs more than the other program.

Suppose, in the extreme scenario, the program "A" with the highest average performance in fact lost all its games against some other program "B". Is that program really so good? It kind of "lucked out" that the field was not populated with clones of "B". Although it might have a high average score, the way to beat it 100% the time is "out of the bag", encoded in program "B". On the other hand, another program for which the best-response way of beating it works only 70% of the time ought to be considered better.

This ranking style can practically only be applied to computer tournaments, as humans would tire (and consequently play differently) for the hundreds of games needed. It works especially well for Chess960, for which the randomized initial position guarantees a different game for each of the many encounters between two programs. In contrast for traditional chess, two programs playing 30 games against each other could (and in fact should, if they are attempting to play game-theoretically optimally) repeat the same game over and over again.

The tournament between 5 engines took a few days to complete, although it was not continuous computation. The time controls were extremely fast: 20 seconds plus 1 second increment, which was about as fast I could get it to go without the programs either flagging for time or entering a panic mode of extremely poor moves and shallow searches. I chose 30 games in order to get a statistically significant result. Obviously, one of the reasons this tournament was possible was because no human operator was needed for the programs.

-----------------Betsy 6.51-----------------
Betsy 6.51 - Chispa 4.03   : 18.0/30 15-9-6 (111=10111=01001001010=10=1==11)  60%   +70
Betsy 6.51 - Frenzee 2.00  : 12.0/30 9-15-6 (01=01=10101=000==010010000=101)  40%   -70
Betsy 6.51 - Hermann 1.5   : 17.5/30 15-10-5 (1010==0011111==10100=101001111)  58%   +56
Betsy 6.51 - Spike 1.0a    : 8.0/30 6-20-4 (11010=0000=00=10=1010000000000)  27%  -173
-----------------Chispa 4.03-----------------
Chispa 4.03 - Betsy 6.51   : 12.0/30 9-15-6 (000=01000=10110110101=01=0==00)  40%   -70
Chispa 4.03 - Frenzee 2.00 : 6.0/30 5-23-2 (00000010000100100=0000001=0001)  20%  -241
Chispa 4.03 - Hermann 1.5  : 12.0/30 11-17-2 (001110010=01110000=01000101010)  40%   -70
Chispa 4.03 - Spike 1.0a   : 5.5/30 5-24-1 (001100100000000001=10000000000)  18%  -263
-----------------Frenzee 2.00-----------------
Frenzee 2.00 - Betsy 6.51  : 18.0/30 15-9-6 (10=10=01010=111==101101111=010)  60%   +70
Frenzee 2.00 - Chispa 4.03 : 24.0/30 23-5-2 (11111101111011011=1111110=1110)  80%  +241
Frenzee 2.00 - Hermann 1.5 : 17.0/30 16-12-2 (=111110101100101011=1000101100)  57%   +49
Frenzee 2.00 - Spike 1.0a  : 8.0/30 6-20-4 (001=00011==001000000=001000001)  27%  -173
-----------------Hermann 1.5-----------------
Hermann 1.5 - Betsy 6.51   : 12.5/30 10-15-5 (0101==1100000==01011=010110000)  42%   -56
Hermann 1.5 - Chispa 4.03  : 18.0/30 17-11-2 (110001101=10001111=10111010101)  60%   +70
Hermann 1.5 - Frenzee 2.00 : 13.0/30 12-16-2 (=000001010011010100=0111010011)  43%   -49
Hermann 1.5 - Spike 1.0a   : 3.5/30 3-26-1 (00000=000010000010000100000000)  12%  -346
-----------------Spike 1.0a-----------------
Spike 1.0a - Betsy 6.51    : 22.0/30 20-6-4 (00101=1111=11=01=0101111111111)  73%  +173
Spike 1.0a - Chispa 4.03   : 24.5/30 24-5-1 (110011011111111110=01111111111)  82%  +263
Spike 1.0a - Frenzee 2.00  : 22.0/30 20-6-4 (110=11100==110111111=110111110)  73%  +173
Spike 1.0a - Hermann 1.5   : 26.5/30 26-3-1 (11111=111101111101111011111111)  88%  +346


300 games played / Tournament finished
Name of the tournament: Arena tournament
Site/ Country: COMPUTER, United States
Level: Blitz 0:20/1
Hardware: Intel(R) Celeron(R) M processor     1.60GHz  with 1,014 MB Memory
Operating system: Microsoft Windows XP Home Edition Service Pack 2 (Build 2600)

No comments :