Dieter B?r?ner wrote:20% is really a lot. I have seen similar things with a smaller magnitude in the effect. I could not explain it, even after carefully studying the generated assembly. In one example, a testing loop of max() tricks, for one specific implementation of max() was faster, then the empty loop. (assembly was exactly the same of the empty loop, just 2 or 3 assembler statements missing compared to the real max() implementation).
In my engine, originally, I did not have an epd-test comand. I had a similar command to test older Crafty/Arasan style testsuites. When I wanted to implement the epdtest, I was lazy and just copied the old code, and changed it approriately. Obviously, in both test codes, practically no time is used, and search() is exactly called once, in exactly the same manner with the same parameters. I could reproduce differences in solution times of over 10%. Nodes to solution was identical. The overhead of setting up the search was totally neglectible. Must be some subtle caching effect. With another compiler, results were identical.
Regards,
Dieter
I'm not sure how many cycles per node Tord is getting.
Probably not something like this :
nps=2570911
I'm getting that at a tiny chessprogram which i wrote for fun.
It's nps drops to 1.9 mln per second when i disable doing all checks in qsearch.
(actually 'doing all checks' is my bad english again. it's either not doing checks or doing checks at first ply in qsearch when SEE >= 0 for a check)
That's just 1 C construct difference in the code. I thought it was already doing them, but i had not noticed a ! in the code somewhere. I found that last night when trying a mate in 2 position (WAC 2) which it took more than 2 ply for to find...
In any case, this program is under 1000 cycles per node at 32 bits processors (you won't be surprised that it's entirely 32 bits code,
though i plan to introduce BITBOARD pawnboard[2] in it), when doing a few checks in qsearch.
So if i change something tiny for it, which costs 100 cycles per node, then the program will slow down about 10%.
However, knowing Tord's coding style, it'll be some sort of a side effect, that some variables overwrite others (ever boundschecked?) and by getting the code size longer you get something like the above effect whereas some feature gets enabled or disabled causing problems.
Blaming the compiler is bad taste simply for a program that's most likely not even close to 1000 cycles per node.
Additional a register stall is not 200 cycles. Only doing some sort of a sinus could possibly eat 200 cycles.
It'll be a side effect or simply a human error.