What about further compressing?
70 stored attack bitboards on all ranks - for all rays, where collapsed_files_index might applied, thus for diagonals, antidiagonals and ranks.
Lets take the "up-shifted" a2-g8 diagonal with bishop on d5, blocker on b3 to clarify the required steps. The mapped square index (minsq) of d5 (file 3, rank 4) on the a2-g8 diagonal is 3, the minimum of rank, file:
1. mask the whole diagonal, occupiedBB & diamask[rayidx]
- Code: Select all
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
2. fill up with one left shift (board right), mul 0x0202020202020202
- Code: Select all
0 0 1 0 1 0 0 0 <== six upper bits
0 0 1 0 1 0 0 0
0 0 1 0 1 0 0 0
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
3.take the upper six bit as occ64, consider the "mirrored" board for arithmetical order:
==> occ64 => 000101B == 5
4. get the arbitrary occ70 by 64*8 byte lookup with [occ64][minsq] ==> occ70 = dense512To70[5][3]
5. get one of the 70 rank attack bitboards on _all_ ranks by [occ70] bitboard lookup.
seventyAttacksOnAllRanks[occ70]:
- Code: Select all
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
6. finally "and" diamask again to get the a2-g8-attacks of our bishop on d5
- Code: Select all
0 0 0 0 0 0 1 0
0 0 0 0 0 1 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
The horizontal shifted diagonals, where file is greater than rank, require an additional shift step. Lets try the b1-h7 diagonal with bishop on e4 and blocker on c2 (to get the "same" occupied state). The mapped square index (minsq) of e4 (file 4, rank 3) on the a2-g8 diagonal is 3, the minimum of rank, file:
1. mask the whole diagonal, occupiedBB & diamask[rayidx]
- Code: Select all
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
2. fill up considering the left shifted (board right!) diagonal, thus we have to multiply with 0x0101010101010101 which is magic[rayidx] here, to get the occupied state on the virtually mapped a1-h8 diagonal. For even more shifted diagonals, we need to mul with 0x0080808080808080 and further right shifts, so that it we loose some "lower" bits, which has however no influence on the upper six bits.
- Code: Select all
0 0 1 0 1 0 0 0 <== six upper bits
0 0 1 0 1 0 0 0
0 0 1 0 1 0 0 0
0 0 1 0 1 0 0 0
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
3.take the upper six bit as occ64, consider the "mirrored" board for arithmetical order:
==> occ64 => 000101B == 5, as expected, the relative same state as in the a2-g8 sample.
4. get the arbitrary occ70 by 64*8 byte lookup with [occ64][minsq] ==> occ70 = dense512To70[5][3]
5. get one of the 70 rank attack bitboards on _all_ ranks by [occ70] bitboard lookup.
seventyAttacksOnAllRanks[occ70], of course also equal to the a2-g8 sample:
- Code: Select all
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
0 1 1 0 1 1 1 1
6. since the b1-h7 diagonal was virtually shifted right (board left) to the a1-h8 diagonal, we have to shift left (board right) before applying the final mask:
- Code: Select all
1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1
1 0 1 1 0 1 1 1
0 0 1 1 0 1 1 1
7. finally "and" diamask again to get the b1-h7-attacks of the bishop on e4
- Code: Select all
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
Following slightly simplified code for diagonals, saves some additional space because seventyAttacksOnAllRanks[70] may be shared by at least diagonals and antidiagonals:
- Code: Select all
u32 file = sq64 & 7;
u32 rank = sq64 >> 3;
i32 delta = file - rank;
u32 shift = delta & ~(delta>>31); // only > 0 for the b1-h7 ..h1-h1 diagonals
u32 rayidx = delta + 7; // 0..14
u32 minsq = file - shift; // min(rank,file)
u32 occ64 = ((occupiedBB & diamask[rayidx]) * magic[rayidx]) >> 58;
u32 occ70 = dense512To70[occ64][minsq];
u64 attack = (seventyAttacksOnAllRanks[occ70] << shift) & diamask[rayidx];
Let's count pure instructions to have a vague impression how it compares with two Kogge-Stome fills:
4 64-bit loads, assuming all fast L1-cache hits
8 32-bit instructions &>>->>+~&-
5 64-bit instructions &*>><<& where mul is ~4 cycles
17 instructions in total with some "parallel" gain.
Direct attack calculation with Kogge-Stone needs 32 64-bit instructions.
- Code: Select all
u64 soWestOne(u64 b) {return (b>>9) & notH;}
u64 noEastOne(u64 b) {return (b<<9) & notA;}
u64 soWestOccl(u64 gen, u64 pro) {
pro = pro & notH;
gen |= pro & (gen >> 9);
pro = pro & (pro >> 9);
gen |= pro & (gen >> 18);
pro = pro & (pro >> 18);
gen |= pro & (gen >> 36);
return gen;
}
u64 noEastOccl(u64 gen, u64 pro) {
pro = pro & notA;
gen |= pro & (gen << 9);
pro = pro & (pro << 9);
gen |= pro & (gen << 18);
pro = pro & (pro << 18);
gen |= pro & (gen << 36);
return gen;
}
So the compressed attack lookup still looks favourable, even for the "complicated" diagonals.
But of course Kogge-Stone may fill several sliders or pseudo sliders in one run, which of course requires a complete different bitboard infrastructure.
Also Kogge-Stone with enough 64-bit (or SIMD) registers available, might gain a much higher ipc, if scheduling multiple directions in parallel with pure register calculation and might hide some pending memory accesses.
Kogge-Stone is nice to get pinned pieces, covered checker and the union of all sliding attacks, not to mention progressive attacks (all attacks in two moves) and looking for trajectories. Also with simd (mmx,sse2), we can apply the cheap occ ^ (occ - 2*rooks) trick for one of eight directions.
As already said for the lookups, whether the compressions are worth, depends a lot...
With fast multiplication we can get rid of the rotated bitboards and their incremental update. We only need one occupied bitboard. With the compression and some additional computation a ressource-friendly approach with less than 4KByte is possible.
Cheers,
Gerd