Winboard Forum

by **jromang** » 12 May 2006, 14:20

Hi every body

I have 2 questions about bitboards :

1) What is the fastest C code (without ASM) for fist-bit last-bit and popcount functions on AMD64 in 64 bit mode ?

2) Is someone using SSE2 extensions (i'm trying to play with intrisics) to speed up bitboard stuff ? Does someone have usefull exemples of what can be done ?

Thanks :-)

by **Gerd Isenberg** » 12 May 2006, 15:35

What is fastest depends on the program. I guess BitScanForward and BitScanReverse intrinsics despite 9 cycle vector path - De Bruijn multiplication might be worth a try for bitscan forward:

http://supertech.csail.mit.edu/papers/debruijn.pdf

Code: Select all: const BitBoard magic = 0x03f79d71b4cb0a89; const unsigned int magictable[64] = { 0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4, 62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5, 63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11, 46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6, }; unsigned int bitScanForward (BitBoard b) { return magictable[((b&-b)*magic) >> 58]; }

I played with SSE2-intrinsics. I think the bit[64]*byte[64] dot product is a SSE2-domain, and fillstuff with quad-bitboards...

Code: Select all: #include <emmintrin.h> typedef unsigned __int64 BitBoard; typedef unsigned char BYTE; #define XMM_ALIGN __declspec(align(16)) int dotProduct(BitBoard bb, BYTE weights[] /* XMM_ALIGN */) { static const BitBoard XMM_ALIGN sbitmask[2] = { 0x8040201008040201, 0x8040201008040201 }; __m128i x0, x1, x2, x3, bm; bm = _mm_load_si128 ( (__m128i*) sbitmask); x0 = _mm_loadl_epi64 ( (__m128i*) &bb); // 0000000000000000:8040201008040201 // extend bits to bytes x0 = _mm_unpacklo_epi8 (x0, x0); // 8080404020201010:0808040402020101 x2 = _mm_unpackhi_epi16 (x0, x0); // 8080808040404040:2020202010101010 x0 = _mm_unpacklo_epi16 (x0, x0); // 0808080804040404:0202020201010101 x1 = _mm_unpackhi_epi32 (x0, x0); // 0808080808080808:0404040404040404 x0 = _mm_unpacklo_epi32 (x0, x0); // 0202020202020202:0101010101010101 x3 = _mm_unpackhi_epi32 (x2, x2); // 8080808080808080:4040404040404040 x2 = _mm_unpacklo_epi32 (x2, x2); // 2020202020202020:1010101010101010 x0 = _mm_cmpeq_epi8 (_mm_and_si128 (x0, bm), bm); x1 = _mm_cmpeq_epi8 (_mm_and_si128 (x1, bm), bm); x2 = _mm_cmpeq_epi8 (_mm_and_si128 (x2, bm), bm); x3 = _mm_cmpeq_epi8 (_mm_and_si128 (x3, bm), bm); // multiply by "and" with -1 or 0 __m128i* pw = (__m128i*) weights; x0 = _mm_and_si128 (x0, pw[0]); x1 = _mm_and_si128 (x1, pw[1]); x2 = _mm_and_si128 (x2, pw[2]); x3 = _mm_and_si128 (x3, pw[3]); // add all bytes bm = _mm_setzero_si128 (); x0 = _mm_sad_epu8 (x0, bm); x0 = _mm_add_epi16 (x0, _mm_sad_epu8 (x1, bm)); x0 = _mm_add_epi16 (x0, _mm_sad_epu8 (x2, bm)); x0 = _mm_add_epi16 (x0, _mm_sad_epu8 (x3, bm)); return _mm_extract_epi16 (x0, 0) + _mm_extract_epi16 (x0, 4); }

I made a C++ wrapper for BitBoard[2] arrays. A base class for the memory layout and two (or three) special register derivates, like <GP>, <XMM> and <MMX>, with most binary and unary operators overloaded.

by **Jean-Francois Romang** » 12 May 2006, 16:52

Thanks for your reply ! I was not aware of the BitScanForward and BitScanReverse intrinsics ... but I think it's Microsoft sp?cific ? I'am working on Linux with Intel compiler...I've not found those in Intel's manual, but I will try to find them.

I will try the De Bruijn multiplication ; currently I'm using

Code: Select all: int bitboard_firstOne(bitboard b) { if (b>>48) return (63-fbits[b>>48]); if ((b>>32)&65535) return (79-(fbits[(b>>32)&65535])); if ((b>>16)&65535) return (95-(fbits[(b>>16)&65535])); return (111-(fbits[b&65535])); }

and in seems not very efficient in 64 bit

Ok for bit[64]*byte[64] dot product but...where is it usefull in your chess program ? Maybe in evaluation function ? (Sorry if I miss the big thing, i have much to learn

)

And what about the fill stuff ? I'm having big troubles to find a way to compute king safety efficiently ; maybe bitboard filling can help there ?

by **Gerd Isenberg** » 12 May 2006, 18:25

... and in seems not very efficient in 64 bit Very Happy

Yes, 64-KByte lookups may pollute your L1-Cache somehow...
Three conditional branches may produce a miss-predicted jump up and then... You may also try Matt Taylor's folded BitScan with 32-bit multiplication or some longlong-double conversion, you'll find in the archives here...

Ok for bit[64]*byte[64] dot product but...where is it usefull in your chess program ? Maybe in evaluation function ?

Yes, probably to do a weighted popcount of attack-sets for several piece types. Usually center is more important than the edges. You may also have array of weight arrays indexed by some king square to consider attacks/defends near the king. For instance with some bitwise instructions you may combine two bishop- (usually on different square colors) with two knight attacks, and to "code" those attacks with two correspondating bits of two bitboards. Value "one" for odd number of attacks and "two" for at least double attacks, to call the dot-product only two - instead of four times for literally all "light" pieces per side ...

There is even a faster dot-product with 90-degree rotation which might be used with precalculated weight arrays. Also if your average weight is below let say 64, you may save some additional cycles to do the horizontal add via _mm_sad_epu8 only once after three vertical bytewise adds (probably with saturation). So you can go with about 34 cycles or so - and some day if we get 128-bit alus 17 cycles or even less...

And what about the fill stuff ? I'm having big troubles to find a way to compute king safety efficiently ; maybe bitboard filling can help there ?

Kogge-Stone Fillstuff is quite expensive register computation, but has the advantage to get sliding attacks setwise, saving a loop and bitscans, eg. with for all rooks and queens per side in the south direction (gen(erator) == rooks|queens; pro(pagator) == empty squares):

Code: Select all: __forceinline u64 southAttacks(u64 gen, u64 pro) { gen |= pro & (gen >> 8); pro = pro & (pro >> 8); gen |= pro & (gen >> 16); pro = pro & (pro >> 16); gen |= pro & (gen >> 32); return gen >> 8; }

With quad-bitboards (aka a pair of 128-bit xmm registers) and vertical nibbles with up to 15 "random" piece codes, there are some creative possibilities, eg. to consider the king and sorrounding squares on different rays per direction to get a set of "virtual" sliding attacks from those squares - so you can easily look with some intersections whether own or opposite rooks/bishops/queen defend/attack those squares, and how often...

Code: Select all: template <class T> __forceinline void southAttacks(QB& t, const QB& s, const u64 &empty) { T* pt = (T*)&t; T g0((T&)s.bb[0]); T g1((T&)s.bb[2]); T p(empty); g0 |= p & (g0 >> 8); g1 |= p & (g1 >> 8); p = p & ( p >> 8); g0 |= p & (g0 >> 16); g1 |= p & (g1 >> 16); p = p & ( p >> 16); g0 |= p & (g0 >> 32); g1 |= p & (g1 >> 32); pt[0] = (g0 >> 8); pt[1] = (g1 >> 8); }

T is either XMM (the intrinsic wrapper) or GP (standard C++ with u64[4], MMX sucks somehow with msc) - the generated assembly looks promising so far. The idea is to push ipc to the max by interlaced scheduling (done hopefully by the compiler with inlined functions) of different direction fills. One may also try to mix different register files for best performance.

by **diepeveen** » 13 May 2006, 02:57

Gerd,
about intrinsics in microsoft visual c++ 2005 compiler.

You might take a look at the assembly it generates.

Might be ugly slow for AMD... ...especially when 64 bits gets compiled to 32 bits. See for example Rybka there.

Rumours say it's doing 2 vector path instructions within 3 instructions. That should give quite some penalties.

Doesn't it?

Must be intrinsics problem in Visual and not his bug.

Vincent

by **Gerd Isenberg** » 13 May 2006, 08:07

Hi Vincent,
yes of course, i will have a look to the generated assembly - and yes, if you or Vas bitscan a 64-bit word in 32-bit mode with bsf-intrinscis or inline asm, you'll indeed need two expensive 8-cycle vector-path bsf reg32,reg32 instructions. Bsr is even more expensive 11/12 cycles vector path for 32/64-bit reg/reg. The intrinsics prototypes allready look a bit "expensive" returning a bool whether a bit was found and passing the found bitindex via pointer or reference - but i guess compiler are quite good in optimizing this - in opposition to former inline assembly

We have some portable bsf-alternatives such as Matt Taylor's routine for 32-bit mode and/or classical De Bruijn multiplication in 64-bit mode. Both require a lookup, most likely L1 - multiplication has become a relative cheap 3/4 cycle direct path instruction.

So actually for me Matt's folded routine is fastest in 32-bit mode, a 3-cycles direct path multiplication and a lookup. For 64 bit-mode one has to try - if inlined very often, the reduced codesize of bsf reg64,reg64 may pay off - moving a 64-bit constant as imul-factor into a register also requires some more code bytes in comparison with a 32-bit immediate imul.

I understand that intel/amd like to save siclicon and make complex instructions like "loop" a lame vector path duck, while compiler guys are forced to replace loop by dec ecx; jnz or even sub ecx,1; jnz. But that one might beat a pure assembly instruction like bsf with a complete c-routine is somehow strange ;-)

Gerd

by **Jean-Francois Romang** » 13 May 2006, 10:19

Thanks for your helpfull reply, I will try all this techniques, hoping some speedup and improvement in my eval function

Winboard Forum

Bitboards on x86_64 and SSE2 extensions

Bitboards on x86_64 and SSE2 extensions

Re: Bitboards on x86_64 and SSE2 extensions

Re: Bitboards on x86_64 and SSE2 extensions

Re: Bitboards on x86_64 and SSE2 extensions

Re: Bitboards on x86_64 and SSE2 extensions

Re: Bitboards on x86_64 and SSE2 extensions

Re: Bitboards on x86_64 and SSE2 extensions

Who is online