Winboard Forum

by **H.G.Muller** » 05 Sep 2006, 23:00

Upto now I was almost totally ignorant about bitboards, so forgive me if I am inquiring about the obvious.

I understood that people use rotated bitboards to be able to isolate a rank, file or diagonal as a group of 6 contiguous bits, so that it can be used as an index in an array (the other index being the position of the attacker on that ray). This then makes it easy to look up the accessible squares.

Why doesn't one use simple multiplication to pack the bits of a single file or diagonal in the most-significant byte of the bitboard (which was originally packed by rank)? I am thinking about something like:

Code: Select all: b = AllPieces>>FileNr; /* files are numbered 0-7 */ b &= 0x0101010101010101LL; /* isolate the file */ b *= 0x8040201008040201LL; /* pack bits */ b >>= 56; /* bring to LSB */ b = FileAttacks[RankNr][b]; /* lookup reachable squares */ b <<= FileNr; /* Move it to the relevant file */

For diagonals it should work similar (with multiplier 0x0101010101010101LL). If we want to ignore the end bits we would of course shift by 57 and substitute the outermost one bits in the multiplier for zeros.

This seems to work without big tables that would pollute my cache (I can use 8 x 64 plus a shift, rather than 64 x 64). Even on a 32-bit machine it should be possible to do this reasonably efficiently, since I could do with only two integer multiplies (const_L*high_word + const_H * low_word).

by **Pradu** » 06 Sep 2006, 01:11

Does this method really work faster for you than normal rotated bitboards in your chess program? The multiply (or numerous shift-adds) and pre-computations looks very expensive if you're just doing it for one file or rank at a time (2x for rook moves, 2x for bishop moves). By the way, you can hash simultaneously the complete rook/bishop moves (both file and rank and both diagonal and anti-diagonal) with 1 multiply and without extraneous shifts to extract the file/rank/diagonal and then put it back in its correct place. I guess Gerd could probably tell us if the usage of the cache really affects speed much here compared to normal rotated bitboards.

In my opinion, I think that rotated bitboard methods are a tad bit outdated in both speed and versatility compared to magic bitboard methods in machines with relatively fast multiply and decent memory. :mrgreen:

In my opinion of course...

by **Edsel Apostol** » 06 Sep 2006, 03:09

Code: Select all: b = AllPieces>>FileNr; /* files are numbered 0-7 */ b &= 0x0101010101010101LL; /* isolate the file */ b *= 0x8040201008040201LL; /* pack bits */ b >>= 56; /* bring to LSB */ b = FileAttacks[RankNr][b]; /* lookup reachable squares */ b <<= FileNr; /* Move it to the relevant file */

Your code looks something like what Gerd has done with his 1.5Kbyte anti-rotated bitboard approach. Browse the topics about Gerd's posts here: http://wbforum.vpittlik.org/viewtopic.php?t=4523&start=40

Edsel

by **Gerd Isenberg** » 07 Sep 2006, 00:07

H.G.Muller wrote:Upto now I was almost totally ignorant about bitboards, so forgive me if I am inquiring about the obvious.

I understood that people use rotated bitboards to be able to isolate a rank, file or diagonal as a group of 6 contiguous bits, so that it can be used as an index in an array (the other index being the position of the attacker on that ray). This then makes it easy to look up the accessible squares.

Why doesn't one use simple multiplication to pack the bits of a single file or diagonal in the most-significant byte of the bitboard (which was originally packed by rank)? I am thinking about something like:
Code: Select all
b = AllPieces>>FileNr; /* files are numbered 0-7 */ b &= 0x0101010101010101LL; /* isolate the file */ b *= 0x8040201008040201LL; /* pack bits */ b >>= 56; /* bring to LSB */ b = FileAttacks[RankNr][b]; /* lookup reachable squares */ b <<= FileNr; /* Move it to the relevant file */
For diagonals it should work similar (with multiplier 0x0101010101010101LL). If we want to ignore the end bits we would of course shift by 57 and substitute the outermost one bits in the multiplier for zeros.

This seems to work without big tables that would pollute my cache (I can use 8 x 64 plus a shift, rather than 64 x 64). Even on a 32-bit machine it should be possible to do this reasonably efficiently, since I could do with only two integer multiplies (const_L*high_word + const_H * low_word).

Your approach is really a ressource friendly compromize. With 6-bit occupied index, FileAttacks needs only 8*64 bitboards = 4KByte.

Code: Select all: b = AllPieces>>FileNr; /* files are numbered 0-7 */ b &= 0x0101010101010101LL; /* isolate the file */ b *= 0x0080402010080402LL; /* pack bits */ b >>= 58; /* bring six bit to LSB */ b = FileAttacks[RankNr][b]; /* lookup reachable squares */ b <<= FileNr; /* Move it to the relevant file */

For diagonals you may apply a generalized up/down shift by -7..0..+7 to move the diagonal to the main-diagonal. It requires a 4K lookup as well.
with

Code: Select all: x (>>) -a => x << a x (<<) -a => x >> a

Code: Select all: b = AllPieces(>>)(DiaNr*8); /* diagonals are numbered -7..0..+7 */ b &= 0x8040201008040201LL; /* isolate the diagonal */ b *= 0x0202020202020202LL; /* pack bits */ b >>= 58; /* bring six bit to LSB */ b = MDiaAttacks[FileNr][b]; /* lookup reachable squares */ b (<<)= DiaNr*8; /* Move upDown back to the relevant diagonal */

16KByte for all directions - or 12.5 with bytewise rank-attacks.
Congrats!

by **H.G.Muller** » 07 Sep 2006, 09:24

The up/down shift is of course nasty since most CPUs don't support that, and you can't implement it by a worst-case shift in one direction followed by a shift in the other without bits falling off, for these wide data types.

Since it shifts over a multiple of 8, though, I would be tempted to do it by means of an unaligned memory access. There also is a penalty for that, but for reads it is not very big. I could not do it directly from the array MDiaAttacks, though, since I would read part of the neighbors for the part I want to see cleared.

So I would have to specifially store it to memory, which might cause so much overhead that it would defeat the purpose:

Code: Select all: static long long int Longs[3]; /* initialized at zero */ const char *Bytes = (char*) &(Longs[1]); /* overlays Longs[1] with array of char */ Longs[1] = MDiaAttacks[FileNr][b]; b = *(long long int*) (Bytes+DiaNr); /* bag of dirty tricks #666 */

You could combine this for the two diagonals, though. The AllPiecess bitboard could be zero-embedded at all times, and the single unaligned read could align if for Rank and two kinds of diagonal attacks. After ORing the attack-boards together, you could implement the reverse 'shift'.

by **Gerd Isenberg** » 07 Sep 2006, 21:22

H.G.Muller wrote:The up/down shift is of course nasty since most CPUs don't support that, and you can't implement it by a worst-case shift in one direction followed by a shift in the other without bits falling off, for these wide data types.

Since it shifts over a multiple of 8, though, I would be tempted to do it by means of an unaligned memory access. There also is a penalty for that, but for reads it is not very big. I could not do it directly from the array MDiaAttacks, though, since I would read part of the neighbors for the part I want to see cleared.

So I would have to specifially store it to memory, which might cause so much overhead that it would defeat the purpose:
Code: Select all
static long long int Longs[3]; /* initialized at zero */ const char *Bytes = (char*) &(Longs[1]); /* overlays Longs[1] with array of char */ Longs[1] = MDiaAttacks[FileNr][b]; b = *(long long int*) (Bytes+DiaNr); /* bag of dirty tricks #666 */

You could combine this for the two diagonals, though. The AllPiecess bitboard could be zero-embedded at all times, and the single unaligned read could align if for Rank and two kinds of diagonal attacks. After ORing the attack-boards together, you could implement the reverse 'shift'.

Nice idea with an unaliged 64-bit access. One has to take care on page or even cache line boundaries with the sorrounding seven additional ranks above and below the occupied board ;-)

The misaligned penalty seems negligible, from the amd64 optimization manul: "A misaligned store or load operation suffers a minimum one-cycle penalty in the processor’s loadstore pipeline."

I had the the 64-bit shld and shrd instructions in mind, aka the x64-intrinsics __shiftleft128 and __shiftright128. To do the -56...+56 generalized shifts by adding 64, and to shift by +8...+120 to/from the higher register. But the shift value is always modulo 64.

Something like this with 7 instructions seems too expensive compared to the dense 1.5 KByte approach, since it has to be applied twice per routine with opposite directions. Otoh we can keep the conditional shift amounts for both generalized shifts and only need 9 instructions for that in total...

Code: Select all: u64 genShift (u64 x, int by) { int mr = by >> 31; // -1 if negative x <<= by & ~mr; // shift left if positive x >>= -by & mr; // shift right if negative return x; }

by **Andreas Guettinger** » 07 Sep 2006, 23:47

Gerd Isenberg wrote:For diagonals you may apply a generalized up/down shift by -7..0..+7 to move the diagonal to the main-diagonal. It requires a 4K lookup as well.
with
Code: Select all
x (>>) -a => x << a x (<<) -a => x >> a

Code: Select all
b = AllPieces(>>)(DiaNr*8); /* diagonals are numbered -7..0..+7 */ b &= 0x8040201008040201LL; /* isolate the diagonal */ b *= 0x0202020202020202LL; /* pack bits */ b >>= 58; /* bring six bit to LSB */ b = MDiaAttacks[FileNr][b]; /* lookup reachable squares */ b (<<)= DiaNr*8; /* Move upDown back to the relevant diagonal */

16KByte for all directions - or 12.5 with bytewise rank-attacks.
Congrats!

I have only a vague idea (so I hope) of what you two speak about, but would it not be possible to shift always to the same site because not main diagonals are always shorter and then somehow compensate for the shift in rank?

i.e.

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0

shifted << (rank+1)

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

instead of shift to right

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

by **Gerd Isenberg** » 08 Sep 2006, 16:04

The problem with horizontal left-shifts (board wise right) are the possible wraps from the h- to a-file. Vertical up-down shifts simply throw unwanted bits of short diagonals out of the board.

On may compensate the wrap with it's implicite up-shift for determining the occupied 64-state and addressing the attackset. But it requires at least one additional mask after shifting the attack set of the main-diagonal back to the original diagonal. If you need the diagonal mask[sq] anyway, you can go with the double upfill-multiplication of the 1.5KByte approach...

Code: Select all: u64 diagonalAttacks(u64 allPieces, u32 sq) { u32 fileNr = sq & 7; maskD = ( diagonalMask[sq] & allPieces ); occ64 = ( 0x0202020202020202 * maskD ) >> 58; upfill = ( 0x0101010101010101 * firstRankAttacks[occ64][fileNr ] ); return ( diagonalMask[sq] & upfill ); }

Assume we want attacks of c2-bishop on the d1-a4 sub-xdiagonal, FileNr = 2:

Code: Select all: 0x0000000001020408 * 0x0202020202020202 = 0x1E1E1E1E1E1C1810 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1<1 1 1 0 0 0> upper six bit 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 = occ64 state 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 = 000111B = 7 0 0 0 0 0 0 0 0 * 0 1 0 0 0 0 0 0 = 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0<1>0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 a1 ==> FirstRankAttack[7][2] := 0x0a attack-byte on the first rank 0x0a * 0x0101010101010101 = 0x0A0A0A0A0A0A0A0A 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 * 1 0 0 0 0 0 0 0 = 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 <0>1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0<1>0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1<0>1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0<1>0 0 0 0 0x0A0A0A0A0A0A0A0A & 0x0000000001020408 = 0x0000000000020008 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 & 0 0 0 0 0 0 0 0 = 0 0 0 0 0 0 0 0 <0>1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<1>0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1<0>1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0<1>0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0

Gerd

by **Andreas Guettinger** » 08 Sep 2006, 17:35

Nice. Does that mean with only 3 "magic" numbers and a 8*64 64-bit array plus some masks I can replace the rotated bitboards, or did I get something wrong?
I would like to try this. Is it reasonably fast?

regards
Andy

by **Gerd Isenberg** » 08 Sep 2006, 18:35

Andreas Guettinger wrote:Nice. Does that mean with only 3 "magic" numbers and a 8*64 64-bit array plus some masks I can replace the rotated bitboards, or did I get something wrong?
I would like to try this. Is it reasonably fast?

regards
Andy

Yes, i think so. See the 64- or 32-bit versions in the Compact Bitboard Attacks thread near the bottom of page 3:
http://wbforum.vpittlik.org/viewtopic.p ... 902&t=4523
http://wbforum.vpittlik.org/viewtopic.p ... 463&t=4523

Memory requirements are 64*8 bytes for the bytewise first rank attack table plus the two 64 bitboard mask-arrays for the diagonals/antidiagonals, thus 1.5KByte in total. One may even thrink the diagonals/antidiagonals lookups down to 15 bitboards each, indexed by the 0..14 diagonal/antidiagonal index, but that requires additional small computation and the masks indexed by square may also needed elsewhere...

I think it is competitive on K8 with 3,4 cycles imul - if you have a lot of cache traffic elsewhere. As always with memory versus computation it depends, and one has to try. How does the code of most likely two scheduled getters interact each other and with it's context, how often do you need to fetch cachelines from L2 with bigger tables? At least the approach is a cheap backup to test against other approaches. See the results of the dumb snoob-test here to get a vague idea:

http://wbforum.vpittlik.org/viewtopic.p ... 949&t=5452
http://wbforum.vpittlik.org/viewtopic.p ... 991&t=5452

I think probably more important than fast attack-getters is to serialize move- or capture-target (sub)sets faster with precalulated move- and capturelists, specially for qsearch... Since i intend to waste memory for that stuff, i'll stay with the 1.5KByte approach, beside quad-bitboard kogge-stone fills ;-)

Cheers,
Gerd

Winboard Forum

A beginner's question for bitboard experts

A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Re: A beginner's question for bitboard experts

Who is online