Why is the opening suites like Nunn, Noomen, and now also the Silver suite , not bigger?
If you do a statistical test on proportions, there is a maximum error of 14 % on the score ratio in 50 games with 95% confidence. If you are comparing evaluation functions this is a large number I believe.
I have done some tests myself, and found great differences in results on different test suites.
I would also like to know, If anyone has tried to measure the variance of results played between the same two engines versus search depth/time controls? (either on the same set of positions or different ones)
My guess is that the variance will be higher at low search depths, due to the greater number of tactical errors.
But the crudeness of the evaluation functions used should maybe
also be considered here.
I find this interesting because the precise variance will give a better (and smaller) estimate of the maximum error, but longer search depths takes longer to compute. What is the quickest way to precise results when matching engines' evaluation functions: short time controls and many games or longer time controls and fewer games?