Winboard Forum

by **Onno Garms** » 12 May 2007, 10:58

Hello all,

I'm experiencing a problem with runtime and executable size instability. Between two versions of my engine, I only have minor changes (added a feature in version B that is not in version A) and no changes at all in the code that is executed in the test in question. In spite of this, the test (doing alpha-beta on the same 10 positions) runs about 5% faster for version B. Also, the executable of version B is about 5% smaller (though it contains one more feature).

My question: How can this happen? I'm using gcc4 on Suse Linux 10.0.

Additional information:

by **Onno Garms** » 12 May 2007, 13:11

This might be related to the linker. The change made x.o depend on y.o. y was not changed. The test does not use x.o, but uses y.o (in a not time-critical way).

I found out that changing the order of input files for the linker call does make differences in runtime in the order of 5%. However, when building my two versions, the order of the files in the linker call was the same.

I still don't know if this is normal or indicates badly written code and how to fix it or find the optimum linking order.

by **bob** » 15 May 2007, 01:57

This is mainly about how your program lays out in memory, which affects how it lays out in cache. Any tiny change to the program can change its size, which changes how it lays out into cache, which changes performance. It is just something you get used to.

by **mjlef** » 15 May 2007, 08:10

Do you have alignment on in the compiler which is appropriate for your processor. Variables should be aligned to 64 or 32 bits, as appropriate. If variables cross machine word boundarys, it can slow things down.

by **Onno Garms** » 15 May 2007, 09:36

Isn't variable alignment done by the compiler rather than the linker?

Up to now, I used to believe that -O3 would choose an appropriate alignment for me. I have set only very few additional options. Is that too naiive?

From gcc manual:

-O turns on the following optimization flags: -fdefer-pop -fdelayed-branch -fguess-branch-probability -fcprop-registers -floop-optimize -fif-conversion -fif-conversion2 -ftree-ccp -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-ter -ftree-lrs -ftree-sra -ftree-copyrename -ftree-fre -ftree-ch -fmerge-constants

-O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags: -fthread-jumps -fcrossjumping- foptimize-sibling-calls -fcse-follow-jumps -fcse-skip-blocks -fgcse -fgcse-lm -fexpensive-optimizations -fstrength-reduce -fre-run-cse-after-loop -frerun-loop-opt -fcaller-saves -fforce-mem -fpeephole2 -fschedule-insns -fschedule-insns2 -fsched-interblock -fsched-spec -fregmove -fstrict-aliasing -fdelete-null-pointer-checks -freorder-blocks -freorder-functions -funit-at-a-time -falign-functions -falign-jumps -falign-loops -falign-labels -ftree-pre

-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops and -fgcse-after-reload options.

by **H.G.Muller** » 15 May 2007, 15:51

I have experienced similar effects with gcc under cygwin. In that case my entire program was contained in a single file, and when I deleted a few unused code lines (that were under an if statement that tested an error condition, which after debugging never occurred anymore) the speed suddenly dropped by about 5%.

This was a very small program (just a 0x88 move generator plus a recursive perft routine), that fitted entirely in L1. And all its data should fit very easily in L1 as well. On the Celeron-M where I observed the effect L1 is 8-way, so if your dataset doesn't actually exceed the L1 size (32KB) it is really very hard to lay it out such that you get collisions, and my data-set was really only a very small fraction of the cache. So I don't think it could be due to caching.

My theory is that it has to do with code alignment. Gcc lays out code such that loops begin at a 16-byte boundary, to make sure the instruction pre-fetcher fetches a full block of instructions after the loop branch. To this end it inserts nop instructions in the code, just before the loop, with a .align directive.

I guess these .aligns are only resolved during link time, as I am not sure that code modules always have a length that is a multiple of 16 bytes. Anyway, from the assembler source you can't see how many nops are inserted. I can very well imagine that for small loops that are not executed many times (such as a ray scan over a board, just testing if the target square is empty and looping as long as it is) are actually hurt by putting many nop instructions before them. Deleting the unused code probably changed the loop alignment from favorable to unfavorable.

by **bob** » 16 May 2007, 17:12

H.G.Muller wrote:I have experienced similar effects with gcc under cygwin. In that case my entire program was contained in a single file, and when I deleted a few unused code lines (that were under an if statement that tested an error condition, which after debugging never occurred anymore) the speed suddenly dropped by about 5%.

This was a very small program (just a 0x88 move generator plus a recursive perft routine), that fitted entirely in L1. And all its data should fit very easily in L1 as well. On the Celeron-M where I observed the effect L1 is 8-way, so if your dataset doesn't actually exceed the L1 size (32KB) it is really very hard to lay it out such that you get collisions, and my data-set was really only a very small fraction of the cache. So I don't think it could be due to caching.

My theory is that it has to do with code alignment. Gcc lays out code such that loops begin at a 16-byte boundary, to make sure the instruction pre-fetcher fetches a full block of instructions after the loop branch. To this end it inserts nop instructions in the code, just before the loop, with a .align directive.

I guess these .aligns are only resolved during link time, as I am not sure that code modules always have a length that is a multiple of 16 bytes. Anyway, from the assembler source you can't see how many nops are inserted. I can very well imagine that for small loops that are not executed many times (such as a ray scan over a board, just testing if the target square is empty and looping as long as it is) are actually hurt by putting many nop instructions before them. Deleting the unused code probably changed the loop alignment from favorable to unfavorable.

You might have overlooked the issue that just because your entire "kernel" can fit into L1 or L2, that absolutely does _not_ mean that it "will fit". It is possible to lay your 32kb kernel (main chess stuff) out in memory so that it all directly maps to one set of lines in cache, and thrash wildly.

Each physical address in RAM maps to one and only one set in cache. If you have 32kb of good stuff, with interlaced "dead code" that can easily cause you to not fit completely in cache.

That's precisely the issue I was talking about.

by **H.G.Muller** » 21 May 2007, 12:13

Yes, I know this. But the Intel L1 caches are 8-way, so the probability that such a thing will happen is something very unlikely to the power eight...

In addition, we are talking about a really small program here, just a 0x88 move generator and a perft to excersize it. Including the dead code the whole thing was not bigger than 10KB, and usually linkers store code consecutively in memory without leaving unnecessary holes. The whole thing was in a single source file anyway, and no library routines were called in the timed code. So I don't see how the thing you mention could ever happen with a 32KB cache. Hardware manufacturers design the cache mapping such that consecutive code causes minimal collisions.

As far as the data cache is concerned: The memory allocation should not be influenced by deleting code. In addition, I always pay attention to the problem you mention, declaring all data my engine uses in the time-critical part as a single array 'char data[N]', so that I can be sure that it is contiguous in memory and no non-critical data is interjected by the linker. I then create the arrays I want to use in my source code through #defines such as:

Code: Select all: #define Board (data+0) #define PiecePos (data+0x80) #define PieceValue ((int *) (data+0x80+32)) ...

to make it absolutely certain that none of the things you mention can ever happen, no matter how malicious the linker might be.

Yet I experienced the effect.

Winboard Forum

Side effects of code changes on runtime

Side effects of code changes on runtime

Re: Side effects of code changes on runtime

Re: Side effects of code changes on runtime

Re: Side effects of code changes on runtime

Re: Side effects of code changes on runtime

Re: Side effects of code changes on runtime

Re: Side effects of code changes on runtime

Re: Side effects of code changes on runtime

Who is online