Winboard Forum

by **Gerd Isenberg** » 21 Mar 2006, 17:25

Gerd Isenberg wrote:And finally the Markoff compressor in conjunction with zeroray:
Code: Select all
i32 delta = file - rank; u32 minsq = rank + ((delta>>31)&delta); // min(rank,file) u32 shift = delta - ((delta>>31)&delta) * 9; u32 occ64 = ((occupiedBB & diaomask[sq]) * 0x0202020202020202) >> 58; u32 occ70 = dense512To70[occ64][minsq]; u64 attack = (seventyDia0Attacks[occ70] << shift) & diafmask[sq];
70 Bitboards per direction, 2240 bytes in total.

oups, small mistake here as well in the other diagonal zeroray attack-getters - we need two different diagonal masks - one with outer squares masks off for the 6-bit occupied state - and the second with all bits on except the source square itself, to mask the shifted diagonal attacks.

<Edit>
correction of the correction for the wrong reason!

We do not need two different ray masks for diagonals if we keep the outer bits inside!
Masking off the outer squares by "and" is simply not necessary, since with multiplying 0x02...02 and >> 58 we implicitly mask off the outer bits anyway. There are still a few possible improvements. For the diagonal masks, it is not necessary to index with source square but with delta+7 into a 15 element bitboard array for each diagonal.

The hopefully final mistake ;-)

is that horicontal shifted diagonals, where the rank is less than the file, need also their occupied 64 index shifted right (board left) to virtually shift to the a1-h8 main diagonal. This might be considered already with the magic multiply - or an extra shift right by shift&7. Ok next error report may follow soon after implementing, testing - but i like to "debug" code in my brain first - sorry if it becomes boring ;-)

</Edit>

Code: Select all: i32 delta = file - rank; u32 rayidx = delta + 7; // 0..14 u32 minsq = rank + ((delta>>31)&delta); // min(rank,file) u32 shift = delta - ((delta>>31)&delta) * 9; u32 occ64 = ((occupiedBB & diamask[rayidx]) * magic[rayidx]) >> 58; u32 occ70 = dense512To70[occ64][minsq]; u64 attack = (seventyDia0Attacks[occ70] << shift) & diamask[rayidx]; the rayidx == delta+7 index 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 3 4 5 6 7 8 9 10 4 5 6 7 8 9 10 11 5 6 7 8 9 10 11 12 6 7 8 9 10 11 12 13 7 8 9 10 11 12 13 14 the intended diagonal magics: magic[0..7] = 0x0202020202020202; magic[8] = 0x0202020202020202>>1; magic[9] = 0x0202020202020202>>2; ... magic[12] = 0x0202020202020202>>5; magic[13] = 0x0202020202020202>>6; magig[14] = 0;

by **Gerd Isenberg** » 22 Mar 2006, 15:05

What about further compressing?

70 stored attack bitboards on all ranks - for all rays, where collapsed_files_index might applied, thus for diagonals, antidiagonals and ranks.
Lets take the "up-shifted" a2-g8 diagonal with bishop on d5, blocker on b3 to clarify the required steps. The mapped square index (minsq) of d5 (file 3, rank 4) on the a2-g8 diagonal is 3, the minimum of rank, file:

1. mask the whole diagonal, occupiedBB & diamask[rayidx]

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2. fill up with one left shift (board right), mul 0x0202020202020202

Code: Select all: 0 0 1 0 1 0 0 0 <== six upper bits 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3.take the upper six bit as occ64, consider the "mirrored" board for arithmetical order:
==> occ64 => 000101B == 5

4. get the arbitrary occ70 by 64*8 byte lookup with [occ64][minsq] ==> occ70 = dense512To70[5][3]

5. get one of the 70 rank attack bitboards on _all_ ranks by [occ70] bitboard lookup.
seventyAttacksOnAllRanks[occ70]:

Code: Select all: 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1

6. finally "and" diamask again to get the a2-g8-attacks of our bishop on d5

Code: Select all: 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The horizontal shifted diagonals, where file is greater than rank, require an additional shift step. Lets try the b1-h7 diagonal with bishop on e4 and blocker on c2 (to get the "same" occupied state). The mapped square index (minsq) of e4 (file 4, rank 3) on the a2-g8 diagonal is 3, the minimum of rank, file:

1. mask the whole diagonal, occupiedBB & diamask[rayidx]

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

2. fill up considering the left shifted (board right!) diagonal, thus we have to multiply with 0x0101010101010101 which is magic[rayidx] here, to get the occupied state on the virtually mapped a1-h8 diagonal. For even more shifted diagonals, we need to mul with 0x0080808080808080 and further right shifts, so that it we loose some "lower" bits, which has however no influence on the upper six bits.

Code: Select all: 0 0 1 0 1 0 0 0 <== six upper bits 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

3.take the upper six bit as occ64, consider the "mirrored" board for arithmetical order:
==> occ64 => 000101B == 5, as expected, the relative same state as in the a2-g8 sample.

4. get the arbitrary occ70 by 64*8 byte lookup with [occ64][minsq] ==> occ70 = dense512To70[5][3]

5. get one of the 70 rank attack bitboards on _all_ ranks by [occ70] bitboard lookup.
seventyAttacksOnAllRanks[occ70], of course also equal to the a2-g8 sample:

Code: Select all: 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1

6. since the b1-h7 diagonal was virtually shifted right (board left) to the a1-h8 diagonal, we have to shift left (board right) before applying the final mask:

Code: Select all: 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 0 1 1 1

7. finally "and" diamask again to get the b1-h7-attacks of the bishop on e4

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Following slightly simplified code for diagonals, saves some additional space because seventyAttacksOnAllRanks[70] may be shared by at least diagonals and antidiagonals:

Code: Select all: u32 file = sq64 & 7; u32 rank = sq64 >> 3; i32 delta = file - rank; u32 shift = delta & ~(delta>>31); // only > 0 for the b1-h7 ..h1-h1 diagonals u32 rayidx = delta + 7; // 0..14 u32 minsq = file - shift; // min(rank,file) u32 occ64 = ((occupiedBB & diamask[rayidx]) * magic[rayidx]) >> 58; u32 occ70 = dense512To70[occ64][minsq]; u64 attack = (seventyAttacksOnAllRanks[occ70] << shift) & diamask[rayidx];

Let's count pure instructions to have a vague impression how it compares with two Kogge-Stome fills:
4 64-bit loads, assuming all fast L1-cache hits
8 32-bit instructions &>>->>+~&-
5 64-bit instructions &*>><<& where mul is ~4 cycles
17 instructions in total with some "parallel" gain.

Direct attack calculation with Kogge-Stone needs 32 64-bit instructions.

Code: Select all: u64 soWestOne(u64 b) {return (b>>9) & notH;} u64 noEastOne(u64 b) {return (b<<9) & notA;} u64 soWestOccl(u64 gen, u64 pro) { pro = pro & notH; gen |= pro & (gen >> 9); pro = pro & (pro >> 9); gen |= pro & (gen >> 18); pro = pro & (pro >> 18); gen |= pro & (gen >> 36); return gen; } u64 noEastOccl(u64 gen, u64 pro) { pro = pro & notA; gen |= pro & (gen << 9); pro = pro & (pro << 9); gen |= pro & (gen << 18); pro = pro & (pro << 18); gen |= pro & (gen << 36); return gen; }

So the compressed attack lookup still looks favourable, even for the "complicated" diagonals.

But of course Kogge-Stone may fill several sliders or pseudo sliders in one run, which of course requires a complete different bitboard infrastructure.
Also Kogge-Stone with enough 64-bit (or SIMD) registers available, might gain a much higher ipc, if scheduling multiple directions in parallel with pure register calculation and might hide some pending memory accesses.
Kogge-Stone is nice to get pinned pieces, covered checker and the union of all sliding attacks, not to mention progressive attacks (all attacks in two moves) and looking for trajectories. Also with simd (mmx,sse2), we can apply the cheap occ ^ (occ - 2*rooks) trick for one of eight directions.

As already said for the lookups, whether the compressions are worth, depends a lot...
With fast multiplication we can get rid of the rotated bitboards and their incremental update. We only need one occupied bitboard. With the compression and some additional computation a ressource-friendly approach with less than 4KByte is possible.

Cheers,
Gerd

by **Gerd Isenberg** » 22 Mar 2006, 18:56

arggg - why so complicated!?
No horizontal shift required for diagonals and antidiagonals at all!
Simply up- and down shift, keeping file as the square of all rays on a vertically extended chessboard with 7+8+7 ranks!

^up shift

Code: Select all: 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

the main diagonal

Code: Select all: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0

v down shift

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

That's it:

Code: Select all: u32 file = sq64 & 7; u32 rank = sq64 >> 3; u32 rayidx = file - rank + 7; // 0..14 u32 occ64 = ((occupiedBB & diamask[rayidx]) * 0x0202020202020202) >> 58; u32 occ70 = dense512To70[occ64][file]; u64 attack = seventyAttacksOnAllRanks[occ70] & diamask[rayidx];

No the balance looks a bit better for lookups
2 64-bit loads, assuming all fast L1-cache hits
1 8-bit load with zero extension
4 32-bit instructions &>>-+
4 64-bit instructions &*>>&

Memory requirement for all rays:

Code: Select all: char dense512To70[64][8]; // shared by all directions u64 seventyAttacksOnAllRanks[70]; // shared by dia and ranks u64 seventyAttacksOnAllFiles[70]; u64 diamask[15]; u64 antidiamask[15]; u64 rankmask[8]; u64 filemask[8];

So that are 8*(70+70+46) + 512 = 1488 + 512 = 2000 Bytes
Is that true? Rotated lookups in a 2KByte 2716 Eprom?
Why didn't i get this 30 years before?

Cheers,
Gerd

by **Tord Romstad** » 22 Mar 2006, 20:54

Gerd Isenberg wrote:Btw. why does this thread degenerate to a kind of monologue?

Hi Gerd,

I can only answer for myself, but the sad truth is that I can't follow you anymore. I haven't had much time to read the WB forum the last couple of weeks. During the time I need to understand one of your posts, you always post two more. :wink:

I am still a bitboard newbie, and don't have much to contribute to your fascinating discussion at the moment. Perhaps I will have some interesting contributions when I start doing chess programming again some time in the second half of this year. I fear that it is more likely that I will just bother you with stupid and elementary questions, though.

Meanwhile, thanks for sharing all these ideas! Sooner or later, I will definitely take a closer look at them.

Tord

by **Tom Likens** » 22 Mar 2006, 21:27

Tord Romstad wrote:
Gerd Isenberg wrote:Btw. why does this thread degenerate to a kind of monologue?

Hi Gerd,

I can only answer for myself, but the sad truth is that I can't follow you anymore. I haven't had much time to read the WB forum the last couple of weeks. During the time I need to understand one of your posts, you always post two more.

I am still a bitboard newbie, and don't have much to contribute to your fascinating discussion at the moment. Perhaps I will have some interesting contributions when I start doing chess programming again some time in the second half of this year. I fear that it is more likely that I will just bother you with stupid and elementary questions, though.

Meanwhile, thanks for sharing all these ideas! Sooner or later, I will definitely take a closer look at them.

Tord

Hey Gerd/Tord (and Tony),

I hope you're happy Gerd, I was completely ready to deep-six bitboards and move onto a hybrid 0x88 architecture I've had my eye on, (as you know, bitboards are completely exhausted, as all the best ideas have already been discovered!! [LOL]). Now though, I have to try these new ideas--the potential return is simply too high to ignore!

Once again Gerd, I hope you're happy! :evil:

Seriously, though as Tord commented this is fascinating stuff. Because of this there will definitely be at least one more release of Djinn before I concentrate 100% on the new engine. The idea that the bitboards could be compressed to this extent is mind-boggling. Hyatt definitely misses out by not following this forum.

To answer your query, these ideas are deep. I've been working with bitboards for *years* and have a fairly decent understanding of them, but like Tord I am having trouble following all your ideas (although I do believe I understand DeBruijn sequences now). Once the rest of us digest the details of what you propose, we'll have plenty of questions and suggestions. So please, bear with us.

take care,
--tom

by **Gerd Isenberg** » 22 Mar 2006, 21:45

Hi Tord,
Hi Tom,

still not tested ;-)

will take some time - as always it's so simple when you got the right track. A very nice occurrence where forum discussion and sharing knowledge lead to those conclusions. And i have been so often on the wrong track with this stuff - so thanks to Tony, Steffan, Dann for recent CCC discussions and you of course starting this thread and to publish Sergej's ideas.

Cheers,
Gerd

by **Tony van Roon-Werten** » 23 Mar 2006, 09:33

Tord Romstad wrote:
Gerd Isenberg wrote:Btw. why does this thread degenerate to a kind of monologue?

Hi Gerd,

I can only answer for myself, but the sad truth is that I can't follow you anymore. I haven't had much time to read the WB forum the last couple of weeks. During the time I need to understand one of your posts, you always post two more.

Tord

I'll see if I can give a summary.

First of all, a lot of the compression is based on the fact that if you have a square and a direction and you have applied a mask, only the first 2 "hits" are important.
ie Rook on C1, piece on B1, piece on D1. Whatever pieces there are on the rest of RANK_1 is of no influence for the calculation of the attacks BB. So if you implement the C1,B1,D1 pattern and map (somehow) C1,B1,D1,E1 to the previous implementation, you save space.

Second: About the masks (*). Now everybody uses MASK[square] (times 4 directions) wich is actually silly.
BB>>Rank(sq)*8 and MASK[File(square)] is enough.
(For ranks: BB>>File(sq) and MASK[Rank(sq)])

Same goes for diagonals, only one shouldn't take the "edge" ranks or files, but rather the centre diagonal. Moving this diagonal up or down, makes the unwanted maskbits "fall off"

These 2 ideas can be further optimized, but these are the basic ideas.

(*) one doesn't actually need any masks. fe for hor attacks, Rank1BB^SetBit(File(sq)) will do.

Cheers,

Tony

by **Gerd Isenberg** » 23 Mar 2006, 12:15

Do you think that was all?
No, there's no time like the present.

What is actually possible now, since all precalculated 70 attackset bitboards keep the same information in all ranks (what a waste!),
is to store the 70 distinct rank-attacks in one byte each. That saves one additional indirection and the 70-bitboards, since dense512To70 does no
longer contain an index, but the bytewise rank-attack itself - versus the cost of one additional 64-bit multiplication with 0x0101010101010101
to fillup that first rank to the whole bitboard:

Code: Select all: u32 file = sq64 & 7; u32 rank = sq64 >> 3; u32 rayidx = file - rank + 7; // 0..14 u32 occ64 = (0x0202020202020202 * (occupiedBB & diamask[rayidx])) >> 58; u64 attack = (0x0101010101010101 * seventyByteAttacks[occ64][file]) & diamask[rayidx];

Until now we stay with 512 byte + 2*15 raymask-bitboards for diagonals and antidiagonals. How can we get rid of the raymasks as well? What about taking the a1-h8 main diagonal as a const 0x8040201008040201 and do a generalized up/down shift with 8*(file-rank)? A conditional branch or 6 or so additional instructions?

Code: Select all: u32 file = sq64 & 7; u32 rank = sq64 >> 3; i32 dnshft = (file - rank) * 8; // or i64 i64 up = (i64)(file - rank) >> 63; // -1 if rank > file for upshift u64 mask = ((0x8040201008040201 << -dnshft) & up) // upshift + ((0x8040201008040201 >> dnshft) & ~up); // downshift u32 occ64 = ( 0x0202020202020202 * (occupiedBB & mask)) >> 58; u64 attack = ( 0x0101010101010101 * seventyByteAttacks[occ64][file]) & mask;

16 instructions, one lookup, two muls. Still the half of 2 Kogge-Stones. May be too expensive - but there are some parallel computations possible. With only one lookup we gain a lot with an interlaced scheduling of different directions, eg. diagonals and antidiagonals for bishop attacks. What can we do further for the files?

Now i will enjoy my short vacation outdoor, springtime - calming down a bit from this fascinating stuff ;-)

Cheers,
Gerd

by **Gerd Isenberg** » 23 Mar 2006, 23:00

What can we do further for the files?

<Edit>
oups that did not work, the seven bit difference is too dense for the 1.rank-a-file rotation! Since the intermediate delta 7 shifted 8-bit pattern overlap and may result in overflows. Corrected version comes soon. Stay tuned ;-)

</Edit>

Yes, file-attacks are bit tricky, but we can really share seventyByteAttacks by all directions. Like in Steffan Westcott's collapsed_ranks_index, and allready mentioned in this thread before, there are some cute multipliation tricks to rotate files and ranks. I will try to explain the steps in detail:

1. The occupied bitboard is shifted right (board left) by the amount of file 0..7, so that the desired file becomes the a-file, which get further masked by and 0x0101010101010101. Let say we have a vertical slider on the fourth rank (3) on the already shifted and masked a-file.

2. We are interested only on the six inner bits of that a-file. We want to map the bits so that they become in the same order the upper six bits. We want to shift a2->c8, a3->d8, a4->e8, a5->f8, a6->g8 and a7->h8 So we have to do six left shifts ored together.

Code: Select all: msb63=h8 msb63=h8 +-----+---------+ +---------+ | 1 0 0 0 0 0 0 0 x x 1 0 1 0 0 1 = 100101B = 25H = 37 +-1 0 0 0 0 0 0 0 ... garbage.... | 0 0 0 0 0 0 0 0 ... | 0 0 0 0 0 0 0 0 | 1 0 0 0 0 0 0 0 | 0 0 0 0 0 0 0 0 +-1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 lsb0=a1

Shift amounts:

Code: Select all: {58,59,60,61,62,63} -{ 8,16,24,32,40,48} ={50,43,36,29,22,15} => 0x0004081020408000

Since no overflow can occur because the bit distance of the file is 8 but from the factor is 7, we can replace "or" by "add" and use a 64-bit multiplication with 0x0004081020408000 for that six shifts. In board representation of 64-bit numbers, the prodcut looks like, x marked bits don't care:

Code: Select all: 0x0004081020408000 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x x 1 0 1 0 0 1 = 100101B = 25H = 37 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 x x x x x x x x 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 x x x x x x x x 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 x x x x x x x x 1 0 0 0 0 0 0 0 * 0 0 0 0 0 1 0 0 = x x x x x x x x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 x x x x x x x x 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 x x x x x x x x 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x x x x x x x x

3. we now are ready to get the "rotated" rank attack via lookup seventyByteAttacks[37][3] already zero extended into a bitboard:

Code: Select all: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 ^

4. Now we need a similar multiplication trick, to rotate 1.rank to the a-file. The constant 0x0002040810204081 does that job, popcount 8, bit 0 is set, so a0 becomes a kind of pivot-square.

Code: Select all: 0x0002040810204081 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x x x x x x x 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 x x x x x x x 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 x x x x x x x 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 x x x x x x x 0 0 0 0 0 0 0 0 * 0 0 0 0 1 0 0 0 = 0 x x x x x x x 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 x x x x x x x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 x x x x x x x 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1 0 x x x x x x x

5.Finally we mask out garbage and shift left (board right) the attacks to it's original file. So that's how it works with this intented code:

Code: Select all: u32 file = sq64 & 7; u32 rank = sq64 >> 3; u64 fileBB = 0x0101010101010101 & (occupiedBB >> file); // a-file u32 occ64 = ( 0x0004081020408000 * fileBB ) >> 58; u64 attack = ( 0x0002040810204081 * seventyByteAttacks[occ64][rank] ); attack = ( 0x0101010101010101 & attack ) << file;

So files, like diagonals and antidiagonals, need two 64-bit multiplication.
Rank-attacks can take advantage of the fact that six occupied bits are already consecutive, so no mul needed but shifts.

You have the choice now, to try "rotated" bitboards in a lot of memory versus calculation nuances. From 512K, 128K, .... 2000, ... down to 512 byte.

Cheers,
Gerd

by **Gerd Isenberg** » 24 Mar 2006, 15:21

oups that did not work, the seven bit difference is too dense for the 1.rank-a-file rotation! Since the intermediate delta 7 shifted 8-bit pattern overlap and may result in overflows. Corrected version comes soon. Stay tuned

The solution is to use 9-bit distance factors, both for mapping occupied bits as well as for the rank-file rotation. Since both multiplications reverse or mirror the pattern in arithmetical order, the result is finally correct. The occupied state mapping is done via {a2->h8,a3->g8,..,a7->c8} with c2-h7 subdiagonal:

Code: Select all: a-file 0x0080402010080400 ___________ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0|1 0 0 1 0 1| => reversed occ64 = 101001B 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 * 0 0 0 0 1 0 0 0 = 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

101001B = 0x29 = 41 is the reversed occupied state, due to the {a2->h8,a3->g8,..,a7->c8} mapping. We have to index seventyByteAttacks by this reversed occ64 as well with the reversed rank. 101001B = 0x29 = 41 instead of 100101B = 25H = 37, and rank 4 instead of 3. seventyByteAttacks[41][4] == 0x6e, zero extended, multiplied by the maindiagonal gets the desired file attacks on the h-file:

Code: Select all: 1.rank-attack 6e 0x8040201008040201 _ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0|0| => reversed occ64 = 101001B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0|1| 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1|1| 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1|1| 0 0 0 0 0 0 0 0 * 0 0 0 1 0 0 0 0 = 1 0 0 0 1 1 1|0| 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0|1| 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 1|1| 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1|0| ^ 4

Masking with 0x8080808080808080 and shifting right (board left) leaves the final attack set.

Code: Select all: u32 file = sq64 & 7; u32 rank = sq64 >> 3; u64 fileBB = 0x0101010101010101 & (occupiedBB >> file); // a-file u32 rocc64 = ( 0x0080402010080400 * fileBB ) >> 58; u64 attack = ( 0x8040201008040201 * seventyByteAttacks[rocc64][rank^7] ); attack = ( 0x8080808080808080 & attack ) >> (file^7);

Puhh... learning math basics again ;-)

"funny" coincidence:
http://graphics.stanford.edu/~seander/bithacks.html

Compute parity of a byte using 64-bit multiply and modulus division
Code: Select all
unsigned char b; // byte value to compute the parity of bool parity = (((b * 0x0101010101010101ULL) & 0x8040201008040201ULL) % 0x1FF) & 1;
The method above takes around 4 operations, but only works on bytes.

but

Reverse the bits in a byte with 4 operations (64-bit multiply, no division):

Code: Select all
unsigned char b; // reverse this byte b = ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;

by **Gerd Isenberg** » 13 Apr 2006, 17:19

An attempt of a 32-bit version of the file-attacks routine with the shared 512-byte lookup - which is probably also superior in 64-bit mode due to 32-bit imuls with immediate operands. This is intended as a "lower bound" implementation in comparison of methods with less instructions but much greater lookup tables.

Cheers,
Gerd

Code: Select all: #define CACHE_LINE 64 #define CACHE_ALIGN __declspec(align(CACHE_LINE)) typedef unsigned char u8; typedef unsigned int u32; typedef unsigned __int64 u64; extern u8 CACHE_ALIGN firstRankAttacks[64][8]; // file attacks with two reversed rotations u64 fileAttacks(u64 occ, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! u32 f = sq & 7; b.b = occ; b.l = (b.l >> f) & 0x01010101; b.h = (b.h >> f) & 0x01010101; b.l = (b.l << 4) + b.h; b.h = (b.l * 0x10080402) >> 26; b.l = 0x08040201 * firstRankAttacks[b.h][(sq^56)>>3]; b.h =((b.l << 4) & 0x80808080) >> (f^7); b.l = (b.l & 0x80808080) >> (f^7); return b.b; }

by **Gerd Isenberg** » 15 Apr 2006, 15:59

The vc2005 express edition compiler tends to use three imuls instaed of two for the file-attacks routine. For instance if i introduce an own u32 variable for the occupied64 state, the compiler transforms the expression

Code: Select all: b.l = (b.l << 4) + b.h; u32 occ64 = (b.l * 0x10080402) >> 26;

to

Code: Select all: u32 occ64 = (b.l*0x804020 + b.h*0x10080402) >> 26;

with this assembly in comparison to the original one:

Code: Select all: mov ecx,esi mov ecx,esi and ecx,7 and ecx,7 shr edx,cl shr edx,cl shr eax,cl shr eax,cl xor ecx,7 xor ecx,7 and edx,1010101h and edx,1010101h and eax,1010101h and eax,1010101h imul edx,edx,10080402h shl edx,4 imul eax,eax,804020h add edx,eax add eax,edx imul edx,edx,10080402h shr eax,1Ah shr edx,1Ah mov edx,esi mov eax,esi xor edx,38h xor eax,38h shr edx,3 shr eax,3 movzx eax,firstRankAttacks[edx+eax*8] movzx eax,firstRankAttacks[eax+edx*8] imul eax,eax,8040201h imul eax,eax,8040201h mov edx,eax mov edx,eax and edx,0F8080808h and edx,0F8080808h shl edx,4 shl edx,4 shr edx,cl shr edx,cl and eax,80808080h and eax,80808080h shr eax,cl shr eax,cl

I found that a bit funny, specially as P4 has rather slow imul, and VC is not particular famous to optimize for AMD.

Diagonals and antidiagonals are quite cheap in 32-bit mode, with the lookup compromise of 64 precalculated masks for diagonals and antidiagonals per square:

Code: Select all: u64 diagonalAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! b.b = occupied & diagonalMask[sq]; b.h = ((b.l + b.h) * 0x02020202) >> 26; b.h = b.l = 0x01010101 * firstRankAttacks[b.h][sq&7]; return b.b & diagonalMask[sq]; } mov ecx,diagonalMask[esi*8] mov edx,diagonalMask+4[esi*8] and eax,edx and edi,ecx add eax,edi imul eax,eax,2020202h shr eax,1Ah mov edi,esi and edi,7 movzx eax,firstRankAttacks[edi+eax*8] imul eax,eax,1010101h and edx,eax and ecx,eax

What i like with the byte-lookup - no further index address computation is needed, [b.h][sq&7] may directly calculated by the AGU via [edi+eax*8]. In opposite to int lookups to save the three cycles up-fill multiplication - here an additional lea instruction is needed, which takes two cycles. The additional win to avoid the second imul looks negligible.

Cheers,
Gerd

by **Chan Rasjid** » 26 May 2006, 11:31

Hello,

I originally implemented my new bitboard program using a plain intuitive bitboard generation for sliding attack maps as I did not know rotated bitboard yet. Then I figured it out, ie, the straight-forward method that computes a 0-63 index of "inner occupied states". I was thinking it may be faster, but it ends up
1.8 % slower. I do a fixed depth search of an early middle game when all sliding pieces are on board and the times are :-

P4 1.4 Ghz 128MB

Plain bitboard DepthSearch
6 e2e4 ( 98) depth( 8) pvL(13) maxPly(29) nps( 73927) Qnode( 0%) sec(97.11 )

Rotated bitboard DepthSearch
6 e2e4 ( 98) depth( 8) pvL(13) maxPly(29) nps( 72647) Qnode( 0%) sec(98.82 )

Almost the same - 1.8% difference

I find it difficult to decipher codes of other programs , the only thing I check with Crafty is it seems to have u64 [64][64] for diagonal lookup just as mine, but I don't know how Crafty does it.

I do incremental updates of all sliding attack maps including the king-as-queen maps. When a move is made, the to-sq may truncate a ray. If the from-sq is at the end of a sliding ray, the ray need an extension. It is expensive to compute.

The lookup tables is like global.

Code: Select all: struct board_t { ............... ............... unsigned char rankLookup[8][64];//[file][state] 512 byte u64 ftrLookup[64][64];//file-to-rank 32 k //to mask-off most-end-bit ( variable length ) int rtdDiagMask[2][2][8];//[sqColor][dir][rank] 128 byte u64 rtdDiagLookup[64][64];//[nonRotatedSq][state] 32k }; //the rotated boards that are incrementally updated in make() //part of my a non- recursive search stack struct stack_incremental_t { .............. ....................... u64 bits[2][8];//128 normal BB u64 bbRtdDiag[2][2];//[sq-color][dir] u64 bbFTR;//no need side, rotated file u64 bbPawnFTR[2];// only for king's pawn cover, 2 side };

As I take care of updating BB, the overhead of updates of the above BB's above is almost unnoticable. In the plain bitboard fixed-depth search, these BB's are still updated, only unused, so updates of these BB's is almost not an issue.

My plain bitboard just make use of the basic property that for single-bit :-
u64 u, v;
if ( u - v ) > 0 then (u - v) has 1's in the v position and all the intervening bits. With this and bitwise operations we can get sliding attacks.

The codes below show the code for plain bishop map when a bishop moves. The codes for rook maps is basically the same except for the static tables. Then follows the code using rotated bitboard. pMS->bb[0] and pMS->bb[1] are the 2 (u64 )directions we want.

Code: Select all: void getBAttackMap(brMove_t * pMS, const u64 all){ static const u64 diagonal[2][16] = { 0x0000000000000001, 0x0000000000000102, 0x0000000000010204, 0x0000000001020408, 0x0000000102040810, 0x0000010204081020, 0x0001020408102040, 0x0102040810204080, 0x0204081020408000, 0x0408102040800000, 0x0810204080000000, 0x1020408000000000, 0x2040800000000000, 0x4080000000000000, 0x8000000000000000, 0000000000000000, 0x0000000000000080, 0x0000000000008040, 0x0000000000804020, 0x0000000080402010, 0x0000008040201008, 0x0000804020100804, 0x0080402010080402, 0x8040201008040201, 0x4020100804020100, 0x2010080402010000, 0x1008040201000000, 0x0804020100000000, 0x0402010000000000, 0x0201000000000000, 0x0100000000000000, 0000000000000000 }; //work with local dir[2] u64 dir[2] = { diagonal[0][RANK64(pMS->from) + FILE64(pMS->from)], diagonal[1][7 + RANK64(pMS->from) - FILE64(pMS->from)] }; u64 bit = dir[0] & dir[1]; assert(bit & all); for (int ii = 0; ii <= 1; ii++){ u64 lb = 1; u64 ub = 0x8000000000000000; assert(bit & dir[ii]); u64 i; if ( !(i = dir[ii] & all & BITS_L(bit)) ); else{ while ( i & i - 1){ i &= i - 1; } lb = i & 0-i; } // ( i - j ) > 0 has 1's in bit j and the intervening bits i = dir[ii] & all & BITS_G(bit); if ( i &= 0-i ){ ub = i; } pMS->bb[ii] = ( pMS->dir[ii] = dir[ii]) & ((ub - lb) | ub); } pMS->bal = pMS->bb[0] ^ pMS->bb[1]; assert( (pMS->bb[0] | pMS->bb[1]) == getBMapSq64(pMS->from, all)); return; } //rotated bitboard _inline u64 lookupRtdDiag(const board_t * board, const stack_t * pS, const int sq, const int dir){ int sqColor = SQ_COLOR(board->brd[sq].sq); int r = RANK64(board->brd[sq].sqRtdDiag[dir]); int index = (int)(pS->stack->bbRtdDiag[sqColor][dir] >> (r << 3) + 1) & board->rtdDiagMask[sqColor][dir][r]; return board->rtdDiagLookup[sq][index]; } _inline u64 lookupFile(const board_t * board, const stack_t * pS, const int sq){ return board->ftrLookup[sq][(int)(pS->stack->bbFTR >> (FILE64(sq) << 3) + 1) & 63]; } _inline u64 lookupRank(const board_t * board, const int sq){ int r = RANK64(sq) << 3; return (u64)board->rankLookup[FILE64(sq)][(int)(board->all >> r + 1) & 63] << r; } void getBMapRtd(const board_t * board, stack_t * pS, brMove_t * pMS){ static const u64 diagonal[2][16] = { 0x0000000000000001, 0x0000000000000102, 0x0000000000010204, 0x0000000001020408, 0x0000000102040810, 0x0000010204081020, 0x0001020408102040, 0x0102040810204080, 0x0204081020408000, 0x0408102040800000, 0x0810204080000000, 0x1020408000000000, 0x2040800000000000, 0x4080000000000000, 0x8000000000000000, 0000000000000000, 0x0000000000000080, 0x0000000000008040, 0x0000000000804020, 0x0000000080402010, 0x0000008040201008, 0x0000804020100804, 0x0080402010080402, 0x8040201008040201, 0x4020100804020100, 0x2010080402010000, 0x1008040201000000, 0x0804020100000000, 0x0402010000000000, 0x0201000000000000, 0x0100000000000000, 0000000000000000 }; pMS->dir[0] = diagonal[0][RANK64(pMS->from) + FILE64(pMS->from)]; pMS->dir[1] = diagonal[1][7 + RANK64(pMS->from) - FILE64(pMS->from)]; const board64_t * pBrd = &board->brd[pMS->from]; int sqColor = SQ_COLOR(pBrd->sq); int r = RANK64(pBrd->sqRtdDiag[0]); int index = (int)(pS->stack->bbRtdDiag[sqColor][0] >> (r << 3) + 1) & board->rtdDiagMask[sqColor][0][r]; pMS->bb[0] = board->rtdDiagLookup[pMS->from][index] & pMS->dir[0]; r = RANK64(pBrd->sqRtdDiag[1]); index = (int)(pS->stack->bbRtdDiag[sqColor][1] >> (r << 3) + 1) & board->rtdDiagMask[sqColor][1][r]; pMS->bb[1] = board->rtdDiagLookup[pMS->from][index] & pMS->dir[1]; pMS->bal = pMS->bb[0] ^ pMS->bb[1]; assert( (pMS->bb[0] | pMS->bb[1]) == getBMapSq64(pMS->from, board->all)); return; } void getRMapRtd(const board_t * board, stack_t * pS, brMove_t * pMS){ static const u64 fileRank[2][8] = { FILE_1, FILE_2, FILE_3, FILE_4, FILE_5, FILE_6, FILE_7, FILE_8, RANK_1, RANK_2, RANK_3, RANK_4, RANK_5, RANK_6, RANK_7, RANK_8 }; pMS->dir[0] = fileRank[0][FILE64(pMS->from)]; pMS->dir[1] = fileRank[1][RANK64(pMS->from)]; pMS->bb[0] = lookupFile(board, pS, pMS->from); pMS->bb[1] = lookupRank(board, pMS->from); assert(pMS->bb[0] & pMS->dir[0]); assert(pMS->bb[1] & pMS->dir[1]); assert(!(pMS->bb[0] & ~pMS->dir[0])); assert(!(pMS->bb[1] & ~pMS->dir[1])); pMS->bal = pMS->bb[0] ^ pMS->bb[1]; assert( (pMS->bb[0] | pMS->bb[1]) == getRMapSq64(pMS->from, board->all)); return; }

Maybe those who is having rotated bitboard can just see the 3 inline functions and will know immediately if anything's wrong.

Best Regards
Rasjid

by **Gerd Isenberg** » 06 Jun 2006, 09:45

Hi Rasjid,
well if i understand correct, you tested your "plain intuitive" approach with loop and branches versus "classical" rotated with 4*32KByte lookups. With respect to memory you find the ressource friendly faster, despite loop and branches?

Also, if i understand correctly you don't use the sliding attack-getters on the fly per node if needed, but to incrementally update a kind of attacktables in make/unmake, depending on the move - whether sliding attacks are affected or not.

So depending on the position the routines might be called more rarely than in an on the fly approach - and therefor the lookup-approach might suffer more from L1-cache miss-hits - even if you pollute your cache elsewhere.

It would be interesting to know, how the minimalistic lookup-approach with some mul and 1.5KByte lookup performs - of course you need an amd64 with fast imul for that. P4 sucks for sure.

- Gerd

by **Chan Rasjid** » 06 Jun 2006, 14:59

Hello Gerd,

Welcome back from Torino. IsiChess did very well despite running on one of Zappa's 512 processor !

My lookup is :-
1 x 32k for rotated file
1 x 32k for both rotated diagonals
512byte for rank

basically u64 lookup[64 = non-rotated sq][64 = state]

I have a basic structure to store my slider BB :-

Code: Select all: struct{ u64 bb[2]; u64 dir[2]; u64 bal; int sq;

For a bishop/rook, bb[2] is my 2 directions of my attack maps terminating in a piece or the board edge, inclusive of the slider bit;
dir[2] is the full sliding directions and to have it near in eval() may be useful. bal = bb[0] ^ bb[1] initially for move-extractions.
When bal == 0, it means all bits have been "generated" through
phases.

I use only 1 x 32 k for a combined diagonals lookup as the sq-color bits are disjoint and I ORed both directions per square. So I use masking to get my bb[i] :-

bb[i] = lookupDiag[][] & dir[i];

All is centralized in

make();
incrementalBBUpdates(side);
incrementalBBUpdates(xside);

then genMoves() ie, extraction of moves...etc.

I think some seem to use 4 rays per bishop/rook lookup.
In my case I can have the up/down directions with:

Code: Select all: u = bb[0] & bb[1]; (u - 1) & bb[0];//up ~(u - 1) & bb[0];//down I cannot yet worry about the extra (u -1) plus & ; otherwise I won't have any running program! Without incremental updates, we need to do lookups _on_the_fly on move generations and also maybe in eval(). I don't know which is better. I think there is still some drawbacks when 32k lookups are used; whatever, they are relatively large arrays in my program so maybe not surprising rotated bitboards don't seem better with me. The speed difference is not noticeable. Unless I made some dumb rotated stuff, otherwise a smartest rotated bitboard may not really be critical. Fruit is strong because of an efficient overall implementation of a full chess program, not just a little module within. About your minimalistic codes that I saved, I have yet to try - they are unlike routines from TSCP which I may just cut and paste whenever I like. Probably, 64 bit will benefit it. Best Regards Rasjid

by **Gerd Isenberg** » 07 Jun 2006, 09:21

Welcome back from Torino. IsiChess did very well despite running on one of Zappa's 512 processor !

Thanks, 32-bit IsiChess MMX run on an dual core amd64 - two threads sharing the hashtable. Probably Zappa missed the win with 25...hxg6. But the resulting position with both exchange down but three connected passers seems quite volatile and therefor hard to find if you score passers not that high even for the Zappa-monster.
It might be a nice position to tune passers with respect to the remaining pieces left:
after 25..hxg6! 26. Bxg7 Bxg2 27. Rh8+ Kxg7 28. Rxa8[diag]R7/1p2ppk1/6p1/3p4/1Kn5/1R6/P1P3b1/8 b - - 0 1
[/diag]

Back to the rotated. Imho you combine "huge" lookup-tables (for the diagonals/antidiagonals) with a lot of code. For that tables you should only add/xor to get the diagonal index or shift amount to shift/mask the occupied state as index for the rotated lookup to immediatly get the attacks of thta square. Using combined 32-KByte lookups for both diagonals and antidiagonals sounds expensive to me.

Here again the compressed diagonal attack getter with nine 32-bit instructions (two imuls) and 1/2KByte for the ranklookup and 1/2 KByte for the diagonalMask. Three memory reads - most likely L1-cache candidates. How many 32-bit instructions does your rotated approach use? How many memory reads?

Code: Select all: u64 diagonalAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! b.b = occupied & diagonalMask[sq]; b.h = ((b.l + b.h) * 0x02020202) >> 26; b.h = b.l = 0x01010101 * firstRankAttacks[b.h][sq&7]; return b.b & diagonalMask[sq]; } mov ecx,diagonalMask[esi*8] mov edx,diagonalMask+4[esi*8] and eax,edx and edi,ecx add eax,edi imul eax,eax,2020202h shr eax,1Ah mov edi,esi and edi,7 movzx eax,firstRankAttacks[edi+eax*8] imul eax,eax,1010101h and edx,eax and ecx,eax

Antidiagonals has almost the same code, despite other masks.

Regards,
Gerd

by **Chan Rasjid** » 09 Jun 2006, 18:12

Hello Gerd,

Currently I am in the midst of doing hashing with repetition dependency. I will later try your minimalistic codes and compare the result. I think my current rotated codes can be adated to get the firstRankAttack() which is needed.

Best Regards
Rasjid

by **Gerd Isenberg** » 28 Jun 2006, 21:18

The 1.5KByte lookup antirotated bitboard approach, intended as a "lower bound" reference against your actual rotated approaches, or Lasse Hansen's super antirotated hashing approach - 64-bit mode. Studying memory versus computation - inlined and cacheline aligned arrays prefered.

Happy cycle hunting,
Gerd

Code: Select all: // a1=0, b1, h8=63-mapping !! // const tables may be easily generated const u8 firstRankAttacks[64][8] = { // [occ64][file] { // occ == 0 -> b1-g1 empty 0xfe, // a1 0xfd, // b1 0xfb, // c1 0xf7, // d1 0xef, // e1 0xdf, // f1 0xbf, // g1 0x7f // h1 }, { // occ = 1 -> b1 occupied 0x02, // a1 0xfd, // b1 0xfa, // c1 0xf6, // d1 0xee, // e1 0xde, // f1 0xbe, // g1 0x7e // h1 }, ... }; const u64 diagonalMask[64] = { 0x8040201008040201, 0x0080402010080402, .... }; const u64 antiDiagMask[64] = { 0x0000000000000001, 0x0000000000000102, ... }; /* * for all getters * @param occ - the occupied bitboard * sq - square index 0..63 (a1..h8) * @return the attack set */ u64 rankAttacks(u64 occ, u32 sq) { u32 f = sq & 7; u32 r = sq & ~7; // rank * 8 u32 o = (u32)(occ >> (r + 1)) & 63; u64 g = firstRankAttacks[o][f]; return g << r; } // file attacks with two reversed rotations // 1. for the occupied state to map a2..a7 to h8..c8 // 2. for the attacks from a1..h1 to h8..h1 u64 fileAttacks(u64 occ, u32 sq) { u32 f = sq & 7; occ = 0x0101010101010101 & (occ >> f); // a-file occ = ( 0x0080402010080400 * occ ) >> 58; occ = ( 0x8040201008040201 * firstRankAttacks[occ][(sq^56)>>3] ); return ( 0x8080808080808080 & occ ) >> (f^7); } u64 diagonalAttacks(u64 occ, u32 sq) { u32 f = sq & 7; occ = ( diagonalMask[sq] & occ ); occ = ( 0x0202020202020202 * occ ) >> 58; occ = ( 0x0101010101010101 * firstRankAttacks[occ][f] ); return ( diagonalMask[sq] & occ ); } u64 antiDiagAttacks(u64 occ, u32 sq) { u32 f = sq & 7; occ = ( antiDiagMask[sq] & occ ); occ = ( 0x0202020202020202 * occ ) >> 58; occ = ( 0x0101010101010101 * firstRankAttacks[occ][f] ); return ( antiDiagMask[sq] & occ ); } u64 rookAttacks(u64 occ, u32 sq) { return rankAttacks (occ, sq) | fileAttacks (occ, sq); } u64 bishopAttacks(u64 occ, u32 sq) { return diagonalAttacks (occ, sq) | antiDiagAttacks (occ, sq); }

by **Gerd Isenberg** » 03 Aug 2006, 16:46

Same as above, but better suited for 32-bit :

Code: Select all: // a1=0, b1, h8=63-mapping !! // const tables may be easily generated const u8 firstRankAttacks[64][8] = { // [occ64][file] { // occ == 0 -> b1-g1 empty 0xfe, // a1 0xfd, // b1 0xfb, // c1 0xf7, // d1 0xef, // e1 0xdf, // f1 0xbf, // g1 0x7f // h1 }, { // occ = 1 -> b1 occupied 0x02, // a1 0xfd, // b1 0xfa, // c1 0xf6, // d1 0xee, // e1 0xde, // f1 0xbe, // g1 0x7e // h1 }, ... }; const u64 diagonalMask[64] = { 0x8040201008040201, 0x0080402010080402, .... }; const u64 antiDiagMask[64] = { 0x0000000000000001, 0x0000000000000102, ... }; /* * for all getters * @param occ - the occupied bitboard * sq - square index 0..63 (a1..h8) * @return the attack set */ u64 rankAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! u32 f = sq & 7; u32 r = sq & 24; // 8 * folded rank if ( sq < 32 ) { b.l = (((u32)occupied) >> (r+1)) & 63; b.l = firstRankAttacks[b.l][f] << r; b.h = 0; } else { b.l = 0; b.h = (((u32)(occupied>>32)) >> (r+1)) & 63; b.h = firstRankAttacks[b.h][f] << r; } return b.b; } // file attacks with two reversed rotations // 1. for the occupied state to map a2..a7 to h8..c8 // 2. for the attacks from a1..h1 to h8..h1 u64 fileAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! u32 f = sq & 7; b.b = occupied; b.l = (b.l >> f) & 0x01010101; b.h = (b.h >> f) & 0x01010101; b.l = (b.l << 4) + b.h; b.h = (b.l * 0x10080402) >> 26; b.l = 0x08040201 * firstRankAttacks[b.h][(sq^56)>>3]; b.h =((b.l << 4) & 0x80808080) >> (f^7); b.l = (b.l & 0x80808080) >> (f^7); return b.b; } u64 diagonalAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! b.b = occupied & diagonalMask[sq]; b.h = ((b.l + b.h) * 0x02020202) >> 26; b.h = b.l = 0x01010101 * firstRankAttacks[b.h][sq&7]; return b.b & diagonalMask[sq]; } u64 antiDiagAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! b.b = occupied & antiDiagMask[sq]; b.h = ((b.l + b.h) * 0x02020202) >> 26; b.h = b.l = 0x01010101 * firstRankAttacks[b.h][sq&7]; return b.b & antiDiagMask[sq]; }

by **Zach Wegner** » 22 Aug 2006, 10:51

Hello Gerd and everyone else,

Gerd Isenberg wrote:
Code: Select all
u64 rayBB = occupiedBB & rayMask[sq]; u32 fold = (u32)(rayBB) | (u32)(rayBB>>32); u32 occ64 = (fold * 0x02020202) >> (32-6);

...
Problem are the vertical files, where all masked occupied bits reside on the same file.

I was playing with these ideas, and I found, using a modified version of Walter Faxon's/Tony's magic finder, how to do a 32-bit folded lookup with one multiply:

Code: Select all: u64 rayBB = occupiedBB & fileMask[sq]; u32 fold = (u32)(rayBB) | (u32)(rayBB>>29); u32 occ64 = (fold * 0x01041041) >> (32-6);

All magics with the same lower 25 bits give the same result, and you can probably change with the 29. It's an interesting magic because each of the bits are 6 away from each other == the number of bits in the index.

Regards,
Zach

Winboard Forum

Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Who is online