Winboard Forum

by **Gerd Isenberg** » 05 Aug 2006, 20:14

Oups, yes sorry for that - it was a copy of the java version with >>> unsigned shift right operator i posted recently ;-)

That is indeed a nice speedup - but programs are probably too different - compiler and optimization settings matter as well. For me it is indifferent and more i a noise range of 0.5 percent, maybe because i use most often this one with reset:

Code: Select all: __forceinline u32 bitScanAndReset(u64 &bb) { u32 *ptr = (u32 *)&bb; u32 a = *ptr; if ( a ) { a &= -(int)a; *ptr ^= a; return MagicTable[(a * 0xe89b2be) >> 27]; } else { ptr++; a = *ptr; a &= -(int)a; *ptr ^= a; return MagicTable[(a * 0xe89b2be) >> 27] + 32; } }

versus

Code: Select all: __forceinline u32 bitScanAndReset(u64 &bb) { { u64 b = bb ^ (bb - 1); bb = bb & (bb - 1); unsigned int fold = ((int) b) ^ ((int)(b>>32)); return lsz64_tbl[(fold * 0x78291ACF) >> (32-6)]; }

Gerd

by **diepeveen** » 13 Aug 2006, 01:11

Tord Romstad wrote:Hello guys,

Although Glaurung currently doesn't use bitboards, I follow these threads with interest. I have made numerous experiments with bitboards over the last year, but so far I have never been happy with the results. My current mailbox board representation seems to be as fast or faster than rotated bitboards for everything I do (this is on a 32-bit CPU). How do the various non-rotated bitboard techniques compare to rotated bitboards on 32-bit computers?

At the moment, my main problems with bitboards are the following:
Most of the research seems to concentrate on the problem of quickly generating a bitboard of all squares attacked by a sliding piece. This looks nice in theory, but in practise I am almost never interested in finding all the squares a piece attack. Most often, what I want to know is simply whether the piece on square x attacks square y. I still haven't found a way to do this with bitboards which is as fast as my current mailbox code (which consists of a table lookup in a 512-byte table followed by a very short and simple loop). What's the best way to do this with bitboards (rotated or not)?
I often have difficulties writing generic code for all piece types with bitboards. I end up writing separate code for each piece type (and sometimes even colour), which makes it difficult to achieve the terse and minimalistic programming style I prefer.
I detest magic and clever tricks, which annoyingly often seem to be necessary to make bitboards fast.
Tord

Yes in general the problem with bitboards is that people are busy so low level that it gets bug sensitive and you no longer work high level. Because in these public fora no one has ever done effort to optimize the 32 bits non bitboard code quite a lot, many who start now start using bitboards.

With respect to quickly determining whether a piece attacks a certain square.

This is what i do in nonbitboards:

Basically 2 compares. First try a quick array,
then do a scan.

Code: Select all: if( looprichting[piece][OpKing][u] && ScanAttack(side,piece,u,OpKing) )

It's pretty fast, just 2 compares. mispredict on K8 is of course when it is within such short code notice only a cycle or 5-7,
just take care that scanattack is a function call and not inlined.

That's in general the trick to get away with little penalty for mispredicts.

You can make that 'scanattack' real fast in nonbitboards.

Note that the looprichting table can be made a lot smaller if your board datastructure is a 16x16 board like Fruit. In diep it's of course a [64] board.

In case of fruit what you'd need is a

Code: Select all: looprichting[piece][constant+OpKing-u]

that reduces array side.

The scanattack can be done extremely fast in nonbitboards by just walking the squares from the value you get out of 'looprichting'.

For sliderpieces it's just a matter of adding that to your square until you get there.

It's quite hard with bitboards to be busy at such cycle level, this where in theory bitboards should be able to do it fast, the reality just is that the array call is going to kill a lot of unnecessary requests and just a few adds until you are at the square of destination is a few cycles anyway. No way that with just a bit of cheapo adds and moves bitboards can beat that on average in a searching program.

If you want to do everything in extremely little RAM then 16x16 board like Fruit has is very useful. My choice for a [64] board is because of historic reasons.

Branches are not bad at K8 and the new woodcrest generation 5160 and such. They already start prepredicting your branch before it's executed. Reduces branch mispredict penalty.

It's important to get rid of slow arithmetic instructions like multiply and i'm pretty convinced in future we want to avoid arrays a tad too. When clock speeds get up it'll be harder and harder for intel and amd to get good L1 read speeds and loading more than 2 simultaneously from L1 will be wishful thinking.

Just make some code with a lot of SHORT branches (that can't be replaced by CMOV's easily by compiler) and you'll see that it nearly always outguns a bunch of arrays in your searching program.

Only long branch mispredicts are a big pain at modern processors, but they sure DID solve the short branches magnificently. I bet both intel nor amd are going to comment on that.

So if within a SHORT for loop you have got this:

Code: Select all: for loop { x = 1; if( y ) x = a[y]; y = x^magicconstant; }

then even if you know for near sure that you get regurarly a mispredict it'll cost you at k8 only a cycle or 5 that mispredict.

However all those bitboard guys are inlining everything. The code gets big then and EVERY mispredict is extremely expensive then, so they need to REALLY get rid of EVERY mispredict or they get crucified.

Perhaps the new woodcrest is nicre for them as it can read ahead nearly 64 bytes or so. So if loop is within 64 bytes then, mispredict is less of a problem.

Yet it's crucial to get a lot of short loops in your program to replace slow bitboard code. Those loops are FAST.

Vincent

by **Gerd Isenberg** » 16 Aug 2006, 19:38

Hi Vincent,

beside all the anachronistical and dogmatical ballast about bitboards versus board arrays, didn't you confuse btb-pollution with predictability? I agree, if you have a lot of branches and already polluted branch target buffer, it makes sense to be careful with inlining tiny routines with branches a lot. Otherwise separate "incarnated" branches from inlined routines have usually simpler prediction pattern, rather than inside a callee, called from a lot of different places and contexts, with probably "randomly" changed actual parameters. I think profile guided optimization should cover that.

Do you know btb-size of woodcrest? K8 has 2K entries, according to the mentioned manual.

Btw. iterative versus recursive search matters as well for K8. 1:0 for iterative search with respect to predict returns correctly, see

Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
Chapter 6 Branch Optimizations page 132,ff
6.4 Pairing CALL and RETURN
6.5 Recursive Functions

Cheers,
Gerd

Winboard Forum

Pretty fast compact bitboard move generator

Re: Pretty fast compact bitboard move generator

Re: Pretty fast compact bitboard move generator

Re: Pretty fast compact bitboard move generator

Who is online