Winboard Forum

Posted: **29 Jul 2005, 15:34**

I just discovered something really strange in my program. My evaluation function contains the following line of code:

Code: Select all: mobility[side] += QMobBonus[5];

QMobBonus[5] has the constant value 0. Therefore, I would expect my program to behave exactly identically with and without this line. As expected, it does behave almost identically. In all positions I have tried, the best moves, node counts, scores and PV have been exactly the same with and without the above line of code.

There is only one difference: The program runs about 20% faster with the silly line of code above. How is this possible? I would really like to remove this line, but 20% is a lot.

Tord

Posted: **29 Jul 2005, 15:41**

One thing I forgot to mention, but which makes the whole thing even more strange: The line of code in question is normally executed at most twice per node searched. I really don't understand how it can have any influence on the program's performance.

Tord

Posted: **29 Jul 2005, 16:45**

Tord,

it may influence the optimization. Maybe the optimizer concludes from variable "side" used once more to reserve a register for it or a related index offset. You should watch an assembler around that statement.

There are a lot of possibilities. Maybe just the existence of this statement is sufficient to prevent a factorization as subroutine. Do you optimize for size?

Reinhard.

Posted: **29 Jul 2005, 16:45**

Hi Tord,

could you show some more context around this line?

Sven

Posted: **29 Jul 2005, 17:12**

A dump of the disassembly of this function might also reveal some fascinating stuff.

Posted: **29 Jul 2005, 18:47**

Hi Tord.

I noticed the same behaviour in my program. I suppose it is connected with CPU's internal optimization. Or maybe the reason is that speed depends on the address of funcion in CPU's instruction cache?

I also noticed another strange behaviour. I made 2 IDENTICAL functions and measured the execution time of them. Sometimes there was over 25% difference! I checked generated assembly code and both were also IDENTICAL!

Another strange behaviour: when I change the order of 2 functions in code, I get different execution times.

Posted: **29 Jul 2005, 18:56**

It's possible that it's causing later code or data to align better to your cache. Look at what's after this code in your memory map. Are there any hotspots? Try using your compiler's align keywords and compile flags to see if you can produce the same speedup with out that line of code.

- Dan Pineo

Posted: **29 Jul 2005, 19:21**

Tord Romstad wrote:I just discovered something really strange in my program. My evaluation function contains the following line of code:

Code: Select all
mobility[side] += QMobBonus[5];

QMobBonus[5] has the constant value 0. Therefore, I would expect my program to behave exactly identically with and without this line. As expected, it does behave almost identically. In all positions I have tried, the best moves, node counts, scores and PV have been exactly the same with and without the above line of code.

There is only one difference: The program runs about 20% faster with the silly line of code above. How is this possible? I would really like to remove this line, but 20% is a lot.

Tord

Hi Tord,

can you mention what compiler you use? Is it 64-bit or 32-bit exceutable?
Profile and optimization tools like vtune etc. would be interesting to inspect what exactly happens.

OTOH today cpus with huge cache- and branch- and whatever heuristics behave chaotical. Minor changes may suddenly result in speedup or slowdown. Kind of Prefetching is an issue - holding an important memory block, one cacheline, in L1 cache, while without reading that address twice per node that particular cacheline may be thrown out of L1-cache.

Also due to reading/adding zero, even a cache miss may be "predicted" correctly, not throwing away some speculative result and other dependend instructions.

The order of defining variables, globals as well as locals or members of structs/classes may have enormous effect too.

Gerd

Posted: **29 Jul 2005, 20:05**

Another code issue at least on amd-athlon32/64 cpus, where some compiler are probably not aware of: if you have 3 conditional branches inside one 16-byte fetch block, the third branch gets always misspredicted:

Code: Select all: cmp eax, ebx ; 16 byte alignment followed by a lot of "short" conditional jumps ; all inside one 16-byte fetch xxx0: je equal1 ; most likely not taken jg greater ; most likely not taken cmp eax, ecx je equal2 ; allways wrong predicted

So some nops or different instruction scheduling or alignment may help a lot here and there. I suggest to more ar less ignore such effects during development - and "trust" compilers ability to take optimal advantage of special target cpus and their heuristics.

From time to time or before you make a final release you may play around a bit with optimizations, weird dummy statements and order or placement of variables.

What about to put a 1 (or other none zero value) into your memory and test the memory read versus +1 ;-)

Gerd

Posted: **29 Jul 2005, 23:17**

Tord Romstad wrote:I just discovered something really strange in my program. My evaluation function contains the following line of code:

Code: Select all
mobility[side] += QMobBonus[5];

QMobBonus[5] has the constant value 0. Therefore, I would expect my program to behave exactly identically with and without this line. As expected, it does behave almost identically. In all positions I have tried, the best moves, node counts, scores and PV have been exactly the same with and without the above line of code.

There is only one difference: The program runs about 20% faster with the silly line of code above. How is this possible? I would really like to remove this line, but 20% is a lot.

Tord

I wonder if the problem is not dependent on the compiler or the optimization.
I wonder if the best compiler or the best optimization is used in both cases.

Uri

Posted: **30 Jul 2005, 00:37**

Uri has a very good point.
A different compiler or different compiler setting may completely change the behavior.

Posted: **30 Jul 2005, 11:34**

Tord Romstad wrote:I just discovered something really strange in my program. My evaluation function contains the following line of code:

Code: Select all
mobility[side] += QMobBonus[5];

QMobBonus[5] has the constant value 0. Therefore, I would expect my program to behave exactly identically with and without this line. As expected, it does behave almost identically. In all positions I have tried, the best moves, node counts, scores and PV have been exactly the same with and without the above line of code.

There is only one difference: The program runs about 20% faster with the silly line of code above. How is this possible? I would really like to remove this line, but 20% is a lot.

Tord

I guess you compared 2 different compiles with each other. Probably 1 with optimizations and 1 without

Posted: **31 Jul 2005, 13:12**

20% is really a lot. I have seen similar things with a smaller magnitude in the effect. I could not explain it, even after carefully studying the generated assembly. In one example, a testing loop of max() tricks, for one specific implementation of max() was faster, then the empty loop. (assembly was exactly the same of the empty loop, just 2 or 3 assembler statements missing compared to the real max() implementation).

In my engine, originally, I did not have an epd-test comand. I had a similar command to test older Crafty/Arasan style testsuites. When I wanted to implement the epdtest, I was lazy and just copied the old code, and changed it approriately. Obviously, in both test codes, practically no time is used, and search() is exactly called once, in exactly the same manner with the same parameters. I could reproduce differences in solution times of over 10%. Nodes to solution was identical. The overhead of setting up the search was totally neglectible. Must be some subtle caching effect. With another compiler, results were identical.

Regards,
Dieter

Posted: **31 Jul 2005, 15:19**

Dieter B?r?ner wrote:20% is really a lot. I have seen similar things with a smaller magnitude in the effect. I could not explain it, even after carefully studying the generated assembly. In one example, a testing loop of max() tricks, for one specific implementation of max() was faster, then the empty loop. (assembly was exactly the same of the empty loop, just 2 or 3 assembler statements missing compared to the real max() implementation).

In my engine, originally, I did not have an epd-test comand. I had a similar command to test older Crafty/Arasan style testsuites. When I wanted to implement the epdtest, I was lazy and just copied the old code, and changed it approriately. Obviously, in both test codes, practically no time is used, and search() is exactly called once, in exactly the same manner with the same parameters. I could reproduce differences in solution times of over 10%. Nodes to solution was identical. The overhead of setting up the search was totally neglectible. Must be some subtle caching effect. With another compiler, results were identical.

Regards,
Dieter

I'm not sure how many cycles per node Tord is getting.

Probably not something like this :
nps=2570911

I'm getting that at a tiny chessprogram which i wrote for fun.
It's nps drops to 1.9 mln per second when i disable doing all checks in qsearch.
(actually 'doing all checks' is my bad english again. it's either not doing checks or doing checks at first ply in qsearch when SEE >= 0 for a check)

That's just 1 C construct difference in the code. I thought it was already doing them, but i had not noticed a ! in the code somewhere. I found that last night when trying a mate in 2 position (WAC 2) which it took more than 2 ply for to find...

In any case, this program is under 1000 cycles per node at 32 bits processors (you won't be surprised that it's entirely 32 bits code,
though i plan to introduce BITBOARD pawnboard[2] in it), when doing a few checks in qsearch.

So if i change something tiny for it, which costs 100 cycles per node, then the program will slow down about 10%.

However, knowing Tord's coding style, it'll be some sort of a side effect, that some variables overwrite others (ever boundschecked?) and by getting the code size longer you get something like the above effect whereas some feature gets enabled or disabled causing problems.

Blaming the compiler is bad taste simply for a program that's most likely not even close to 1000 cycles per node.

Additional a register stall is not 200 cycles. Only doing some sort of a sinus could possibly eat 200 cycles.

It'll be a side effect or simply a human error.

Posted: **31 Jul 2005, 20:01**

diepeveen wrote:I'm getting that at a tiny chessprogram which i wrote for fun.
It's nps drops to 1.9 mln per second when i disable doing all checks in qsearch.
(actually 'doing all checks' is my bad english again. it's either not doing checks or doing checks at first ply in qsearch when SEE >= 0 for a check)

That's just 1 C construct difference in the code.

Vincent, this is a very different situation from the one Tord pointed out (if I understand correctly). Your change gives a different search tree, different calls of functions in the same situation, etc.

Tord basically deleted 2 (or so) assembly statements from the produced code, that do not change anything at all. Just if they were 2 noops.

Of course I oversimplified above. And other things might change, things that were mentioned already in this thread. Still a surprisingly large difference, and in the different direction, one would expect.

An interesting test might be, to edit the generated assembly, and throw out the statements. (If variable side is allocated into a register and reused later, the assignment to the register must be left in of course, but the memory access can still be thrown out. Similar thougts might apply for other variables accessed). Then recompile the assembly. I'd guess, it will already be slower.

Regards,
Dieter

Posted: **01 Aug 2005, 02:33**

As i pointed out, he probably has made a mistake somewhere causing either a different search tree, or did mix 2 different type of compiles,
or had forgotten some ini file, or some other bug in the program caused it.

You assume he made 0 bugs in testing.

If i may honestely ask did you *ever* check out the Glaurung source code?

It's a *disaster*

One bug gets fixed by another one.

If there is something overwriting at adress X, he adds somethign to the code, then those tables might land a few adresses further.

So a boundsbug then will land somewhere else, causing a different tree or whatever.

All of that can explain 20% difference in speed.

Not 2 NOOPs.

The odds his report was flawed or caused by some other bug or testing mistake is simply 99.999%; the odds it is a compiler bug causing 20% speed diff is really less than 1 in a 100000. Let's face it that he's running on a modern processor that knows how to reorder code and has another X tricks, so all kind of weird things from the past, causing sometimes 5% difference in execution speed (note NEVER 20%) cannot happen.

Let's not go get theoretic discussions on scenario's that happened to a very good programmer like you are. Let's just first get him check his code again for some bugs and if he can't find it, perhaps you can.

I advice first do a boundscheck of his code, that should solve a 100 bugs or so.

Dieter B?r?ner wrote:
diepeveen wrote:I'm getting that at a tiny chessprogram which i wrote for fun.
It's nps drops to 1.9 mln per second when i disable doing all checks in qsearch.
(actually 'doing all checks' is my bad english again. it's either not doing checks or doing checks at first ply in qsearch when SEE >= 0 for a check)

That's just 1 C construct difference in the code.

Vincent, this is a very different situation from the one Tord pointed out (if I understand correctly). Your change gives a different search tree, different calls of functions in the same situation, etc.

Tord basically deleted 2 (or so) assembly statements from the produced code, that do not change anything at all. Just if they were 2 noops.

Of course I oversimplified above. And other things might change, things that were mentioned already in this thread. Still a surprisingly large difference, and in the different direction, one would expect.

An interesting test might be, to edit the generated assembly, and throw out the statements. (If variable side is allocated into a register and reused later, the assignment to the register must be left in of course, but the memory access can still be thrown out. Similar thougts might apply for other variables accessed). Then recompile the assembly. I'd guess, it will already be slower.

Regards,
Dieter

Posted: **01 Aug 2005, 10:03**

diepeveen wrote:
...So a boundsbug then will land somewhere else, causing a different tree or whatever...

I think that different tree cannot be the problem because he reported that he get the same number of nodes and the only difference is speed.

I did not check his code and bugs are of course a possible option.

Uri

Posted: **01 Aug 2005, 11:54**

Hi all,

Thanks for the feedback!

I discovered this strange phenomenon when I experimented with evaluating queen mobility. In the private 0.2.5x version which I have sent to some beta testers, I have the following code:

Code: Select all: for(sq=PL_START((side<<3)|QUEEN); sq<=H8; sq=PL_NEXT(sq)) { queens[side]++; /* Queens on 7th and 8th rank: */ if(PawnRank[side][sq] >= RANK_7) rooks_on_7th[side]++; /* Compute mobility */ x = 0; ptr = MobilityIncrement[side]; for(tosq = sq + 15; Board[tosq] == EMPTY; tosq += 15) x++; x += ptr[Board[tosq]]; for(tosq = sq - 15; Board[tosq] == EMPTY; tosq -= 15) x++; x += ptr[Board[tosq]]; for(tosq = sq + 17; Board[tosq] == EMPTY; tosq += 17) x++; x += ptr[Board[tosq]]; for(tosq = sq - 17; Board[tosq] == EMPTY; tosq -= 17) x++; x += ptr[Board[tosq]]; for(tosq = sq + 1; Board[tosq] == EMPTY; tosq++) x++; x += ptr[Board[tosq]]; for(tosq = sq - 1; Board[tosq] == EMPTY; tosq--) x++; x += ptr[Board[tosq]]; for(tosq = sq + 16; Board[tosq] == EMPTY; tosq += 16) x++; x += ptr[Board[tosq]]; for(tosq = sq - 16; Board[tosq] == EMPTY; tosq -= 16) x++; x += ptr[Board[tosq]]; x = max(x, 0); mobility[side] += QMobBonus[x]; e_mobility[side] += QMobBonusEndgame[x]; }

I wanted to compare this version with a similar version without the queen mobility evaluation, and therefore commented out the code above and recompiled. To my astonishment, this made the program much slower. After some experimentation, I found out that the mobility[side] += QMobBonus[x] statement that made all the difference. Removing the rest of the code and keeping this single statement produced an executable without the 20% slowdown. I then replaced QMobBonus[x] with QMobBonus[5] because this turned out to be just as fast, and QMobBonus[5] happens to be 0.

Of course, I checked very carefully that the compiler settings were always the same, and recompiled and tested several times in order to make sure I did not make any stupid and trivial mistake somewhere.

The compiler used was GCC 3.3 on a Pentium IV Linux machine. I could not reproduce the strange behavior with GCC 4.0 on my iMac G5. On the Mac, the mobility[side] += QMobBonus[5] statement did not have any measurable effect on performance.

Like others, I have sometimes observed that tiny and intuitively unimportant changes in my code will sometimes unexpectedly make the program a few percent faster or slower. I have never seen anything close to 20%, though.

This problem is no longer very important to me, because I have decided I want to keep queen mobility. However, I still think this behaviour is extremely weird, and it would be interesting to understand what is really happening. Perhaps I should have a look at the assembly language output of the compiler, as Gian Carlo suggested.

Who knows, perhaps I could double the speed of my engine by adding some irrelevant line of code at the right place in my program? :wink:

Tord

Posted: **01 Aug 2005, 14:36**

Hi Tord,

please check the value of x against its allowed maximum between these two lines:

Code: Select all: x = max(x, 0); /* check x here */ mobility[side] += QMobBonus[x]; e_mobility[side] += QMobBonusEndgame[x];

for instance by adding an assert(). Maybe accessing QMobBonus[] with an illegal x has an unwanted side effect which makes some other code of your program faster (but wrong)?

Sven

Posted: **01 Aug 2005, 14:54**

Tord Romstad wrote:The compiler used was GCC 3.3 on a Pentium IV Linux machine. I could not reproduce the strange behavior with GCC 4.0 on my iMac G5. On the Mac, the mobility[side] += QMobBonus[5] statement did not have any measurable effect on performance.

I've also had strange things happening to me with that compiler. One I remember very well, was that adding an empty if statement (with a simple no-side-effects expression) would make the program faster, and also change the node count. Although I did never discover what exactly was going on, I strongly suspect it was caused by some bugs elsewhere. After some time and several bug fixes, I tried again and could not reproduce that behavior anymore...

Winboard Forum

How can this make my program faster?

How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?

Re: How can this make my program faster?