Winboard Forum

by **Zach Wegner** » 15 Mar 2006, 21:42

Tord Romstad wrote:Hello Zach,

Perhaps I'm stupid, but I don't understand what you mean at all. Could you please explain this in greater detail?

Hello Tord,

Yes, that wasn't a very good explanation. What I mean is that AttackIndex returns a char now, which is multiplied by 8 and added to HorizontalAttacksArray along with the rank when looking up the bitboard, i.e. HAA[x][y]=*(HAA + x*8 + y). You could just change the type of AttackIndex to a pointer, and store HorizontalAttacksArray[x] (that is, HAA+x*8) rather than just x, so you save a multiply(actually shift) and an add. A problem, that I just realize now, is that you must have a separate AttackIndex for each direction.

Also, aren't there only 15 diagonals?

Regards,
Zach

by **Tony van Roon-Werten** » 16 Mar 2006, 08:28

Dann Corbit wrote:You can always calculate the numbers by creation of a minimal perfect hash. Do a google search for "minimal perfect hash" and it will turn up plenty of good explanations.

Actually, I think the optimization is a non-perfect hash :?:

Tony

by **Gerd Isenberg** » 16 Mar 2006, 16:16

Yes, going for the 70 distinct attacks per ray is a neat trick i was not aware of. Whether this additional indirection with a byte-array lookup is worth to dense the tables is another question with todays huge L1-caches and L1-heuristics. The greater your final structures to address are (perhaps not only attacksets but also let say some (2*2) serialized movelists), the more it safes.

Also the multiplication approach to get "rotated" occupied-states with "only" 1/2K attack index lookups for the straight directions might be applied:

Code: Select all: occ64 = ((occupiedBB & hormask[sq]) * hormagic[sq]) >> (64-6); idx70 = AttackIndexHor70[file(sq)][occ64]; // 8 * 64 * 1Byte attack = HorizontalAttacksArray[idx70][rank(sq)]; // 70*8(16)*8Byte

<Edit> oups - occ64 must have always the same mapping over all squares of one direction here, what must not be the case in the initial approach with 64 state * 64 square lookup per direction.

May be it is possible to find constant magics per direction, independent of square, i think for ranks and diagonals a multiplication with 0x0202020202020202 may be sufficient:

Code: Select all: occ64 = ((occupiedBB & hormask[sq]) * 0x0202020202020202) >> (64-6); idx70 = AttackIndexHor70[file(sq)][occ64]; // 8 * 64 * 1Byte attack = HorizontalAttacksArray[idx70][rank(sq)]; // 70*8(16)*8Byte

<Edit/>

The 70 indices already imply the source index 0..7 on that rank:
(except the pattern 01111110)

rooks on a1 (h1) have 7 distinct attack sets
00000010 0
00000110 1
00001110 2
00011110 3
00111110 4
01111110 5
11111110 6

rooks on b1 (g1) have 6 distinct attack sets:
00000101 7
00001101 8
00011101 9
00111101 10
01111101 11
11111101 12

rooks on c1 (f1) have 2*5=10 distinct attack sets:
00001010 13
00011010 14
00111010 15
01111010 16
11111010 17

00001011 18
00011011 19
00111011 20
01111011 21
11111011 22

rooks on d1 (e1) have 3*4=12 distinct attack sets:

00010100 23
00110100 24
01110100 25
11110100 26

00010110 27
00110110 28
01110110 29
11110110 30

00010111 31
00110111 32
01110111 33
11110111 34
...

so there are those 7 + 6 + 10 + 12 + 12 + 10 + 6 + 7 = 70 different attack sets per ray. What about enumerating attacksets on glued 1-7,2-6,3-5 and 4-4 diagonals? ;-)

I would also like to mention Steffan Westcott's approach to calculate 256 occupied states from a masked ray:

For a ray along a rank or diagonal, it is sufficient to collapse the files:

Code: Select all
typedef unsigned __int64 BitBoard; int collapsed_files_index(BitBoard b) { b |= b >> 32; b |= b >> 16; b |= b >> 8; return b & 0xFF; }

For a ray along a file, a little more trickery is needed:

Code: Select all
int collapsed_ranks_index(BitBoard b) { b |= b >> 4; // No masking needed b |= b >> 2; // " " b |= b >> 1; // " " return ((b & 0x0101010101010101) * 0x0102040810204080) >> 56; }

collapsed_files_index should be rather cheap in 32-bit mode, compiler may generate some assembly with 6 cheap instructions, saving >>32 and >>8:

Code: Select all: or eax, edx mov edx, eax shr eax, 16 or eax, edx or al, ah and eax, 255

The neat rotation in collapsed_ranks_index looks a bit more expensive, so that i would prefere the magic-mul.

by **Dann Corbit** » 16 Mar 2006, 20:21

Tony van Roon-Werten wrote:
Dann Corbit wrote:You can always calculate the numbers by creation of a minimal perfect hash. Do a google search for "minimal perfect hash" and it will turn up plenty of good explanations.

Actually, I think the optimization is a non-perfect hash

Tony

There are lots of ways to solve the problem. A minimal perfect hash can be calculated that will take the 128 possible bitboards as inputs and spit out a number between 0 and 127 as outputs.

If the goal is to turn the bitboard into an index, then that is one way to do it.

by **Tom Likens** » 16 Mar 2006, 20:50

Tord Romstad wrote:The following macro can now be used to compute horizontal attack bitboards (untested, bugs are possible):
Code: Select all
#define HorizontalAttacks(sq) \ HorizontalAttacksArray[AttackIndex[file(sq)][(R00bb>>(rank(sq)/8))&0xFF]][rank(sq)]

Tord,

Shouldn't the rank of the square be multiplied by 8, not divided by 8?

Code: Select all: #define HorizontalAttacks(sq) \ HorizontalAttacksArray[AttackIndex[file(sq)][(R00bb>>(8*rank(sq)))&0xFF]][rank(sq)]

regards,
--tom

by **Tony van Roon-Werten** » 17 Mar 2006, 08:42

Dann Corbit wrote:
Tony van Roon-Werten wrote:
Dann Corbit wrote:You can always calculate the numbers by creation of a minimal perfect hash. Do a google search for "minimal perfect hash" and it will turn up plenty of good explanations.

Actually, I think the optimization is a non-perfect hash

Tony

There are lots of ways to solve the problem. A minimal perfect hash can be calculated that will take the 128 possible bitboards as inputs and spit out a number between 0 and 127 as outputs.

If the goal is to turn the bitboard into an index, then that is one way to do it.

That seems to be the goal, but it actually isn't.

I have done my first experiments, and they work.

My idea was to get rid of the expensive 64 multiplication so I tried a folded deBruijn multiplication.

Unfortunately, it didn't deliver unique indexes so I had to search further. Then it stroke me that I didn't need unique indexes.

Consider a rook on C1, and pieces on B1 and D1. This delivers an index A. Now add another piece on E1. The index this delivers doesn't have to be unique. Since this bitboard is equivelent to the previous, it is also allowed to deliver index A. Add another piece on F1, same story, remove E1, same, etc ...

Theorethicly, on can do with an index range of 12

Still my folded deBruijn didn't work. I was using(unsigned int) (bb)^(bb>>32) and the problem were the masks where the upper part of bb is almost the same as the lower part. ( so basicly all full rook rank attacks ie A1-A8,B1-B8, and the diagonals.

I tried shifting them 1 file more: (unsigned int) (bb)^(bb>>33) but that didn't work either. But combining those into (unsigned int) (bb)^(bb>>32)^(bb>>33) finally worked.

Together with to optimization of allowing duplicate indexes, I don't even have to store them anymore. Calculation takes <2 secs.

For speed: My guess about the 20% loss for the 64b multiplication were correct. My engine is now 20% faster. so 600Kn/s in the middlegame on a PM 1.8

( Well, actually, that's after 30 secs, I'm doing some other experimental stuff, that needs to get into a hashtable because calculation is slow. The first 30 secs are needed to do this. I start off with <100 Kn/s, but that's because I clear this hastable when setting up a position)

Tony

by **Gerd Isenberg** » 17 Mar 2006, 11:41

Hi Tony,

well, to go to the folded multiplication to get the 64-inner occupied state of the four different directions, i suggest to have a closer look to Steffan Westcott's routines again. We were both a bit off the track - i with my De Bruijn framework for that purpose - you slightly as well with the random numbers to get almost different magic factors for different squares and rays ;-)

For rank and diagonals there is up to one bit per file in the masked ray - never more than one. Thus a 64-bit multiply of the masked rank or diagonal with 0x0101010101010101 is sufficient to "fill" up all bits to the eighth rank. With one implicite shift left one by multiply with 0x0202020202020202 we have the relevant inner six occupied bits of the masked ray at the upper six bits of the 64-bit product - so that the usual shift right 58 leaves the unique occupied state.

32-bit folding for this three rays is quite simple as well, because oring high- and low-board still leaves up to one set bit per each "compressed" file:

Code: Select all: u64 rayBB = occupiedBB & rayMask[sq]; u32 fold = (u32)(rayBB) | (u32)(rayBB>>32); u32 occ64 = (fold * 0x02020202) >> (32-6);

With the fast 3-cycle 32-bit multiplication of amd64 in 32-mode this looks competetive, specially for the diagonals. For ranks, where all occupied states are already neighboared bits, one may of course save the multiplication and simply shift the six inner occupied bits right to the six lower bits by 8*rank(sq)+1 or ((sq64 & ~7)+1):

Code: Select all: u64 rankBB = occupiedBB & rankMask[sq]; u32 occ64 = rankBB >> ((sq & 0x38)+1);

or eventually in 32-bit mode with folding:

Code: Select all: u64 rankBB = occupiedBB & rankMask[sq]; u32 fold = (u32)(rankBB) | (u32)(rankBB>>32); u32 occ64 = fold >> (sq & 0x18)+1);

Problem are the vertical files, where all masked occupied bits reside on the same file. Since we know the file, we can make specialized collapsed_ranks_index routine, already considering up to 6-inner occupied states:

Code: Select all: u64 fileBB = occupiedBB & fileMask[sq]; u32 occ64 = (fileBB * (0x0102040810204080*2)>>(sq&7)) >> (64-6);

or with a small magic-table[8]:

Code: Select all: u32 occ64 = (fileBB * magic[sq&7]) >> (64-6);

With 32-bit one may try to compute low- and high board separately to "or" the final result together:

Code: Select all: u64 fileBB = occupiedBB & fileMask[sq]; u32 lowBB = (u32)fileBB; u32 highBB = (u32)(fileBB>>32); u32 locc = (lowBB * ((0x10204080*2)>>(sq&7))); u32 hocc = (highBB * ((0x01020408*2)>>(sq&7))); u32 occ64 = (hocc + locc) >> (32-6);

This is all not completely tested and may contain bugs. But the advantage is that all occ64 for all squares per directions have the same bit-order semantic - in opposite to applying vectors of individual fitting random numbers or De Bruijns as magic factors by [sq][dir]-lookup, where the final precalulated attackset-tables must consider individual mapping of unique numbers.

One possible advantage of this might be to condense further with the trick invented by Sergej Markoff, Tord mentioned with the 70 unique attack pattern. I even think that it should be possible to glue diagonals in the same way as rotated lookups and to index diagonals by (sq-Rank(sq)) & 7) or ((~sq)-Rank(sq)) & 7).

I personally don't bother any longer on 32-bit mode, and to have probably conditional compiled code for 32- and 64-bit mode.

Cheers,
Gerd

by **Tony van Roon-Werten** » 17 Mar 2006, 13:47

Gerd Isenberg wrote:Hi Tony,

well, to go to the folded multiplication to get the 64-inner occupied state of the four different directions, i suggest to have a closer look to Steffan Westcott's routines again. We were both a bit off the track - i with my De Bruijn framework for that purpose - you slightly as well with the random numbers to get almost different magic factors for different squares and rays

For rank and diagonals there is up to one bit per file in the masked ray - never more than one. Thus a 64-bit multiply of the masked rank or diagonal with 0x0101010101010101 is sufficient to "fill" up all bits to the eighth rank. With one implicite shift left one by multiply with 0x0202020202020202 we have the relevant inner six occupied bits of the masked ray at the upper six bits of the 64-bit product - so that the usual shift right 58 leaves the unique occupied state.

32-bit folding for this three rays is quite simple as well, because oring high- and low-board still leaves up to one set bit per each "compressed" file:

Code: Select all
u64 rayBB = occupiedBB & rayMask[sq]; u32 fold = (u32)(rayBB) | (u32)(rayBB>>32); u32 occ64 = (fold * 0x02020202) >> (32-6);

With the fast 3-cycle 32-bit multiplication of amd64 in 32-mode this looks competetive, specially for the diagonals. For ranks, where all occupied states are already neighboared bits, one may of course save the multiplication and simply shift the six inner occupied bits right to the six lower bits by 8*rank(sq)+1 or ((sq64 & ~7)+1):
Code: Select all
u64 rankBB = occupiedBB & rankMask[sq]; u32 occ64 = rankBB >> ((sq & 0x38)+1);
or eventually in 32-bit mode with folding:

Code: Select all
u64 rankBB = occupiedBB & rankMask[sq]; u32 fold = (u32)(rankBB) | (u32)(rankBB>>32); u32 occ64 = fold >> (sq & 0x18)+1);

Problem are the vertical files, where all masked occupied bits reside on the same file. Since we know the file, we can make specialized collapsed_ranks_index routine, already considering up to 6-inner occupied states:
Code: Select all
u64 fileBB = occupiedBB & fileMask[sq]; u32 occ64 = (fileBB * (0x0102040810204080*2)>>(sq&7)) >> (64-6);

or with a small magic-table[8]:
Code: Select all
u32 occ64 = (fileBB * magic[sq&7]) >> (64-6);
With 32-bit one may try to compute low- and high board separately to "or" the final result together:
Code: Select all
u64 fileBB = occupiedBB & fileMask[sq]; u32 lowBB = (u32)fileBB; u32 highBB = (u32)(fileBB>>32); u32 locc = (lowBB * ((0x10204080*2)>>(sq&7))); u32 hocc = (highBB * ((0x01020408*2)>>(sq&7))); u32 occ64 = (hocc + locc) >> (32-6);

This is all not completely tested and may contain bugs. But the advantage is that all occ64 for all squares per directions have the same bit-order semantic - in opposite to applying vectors of individual fitting random numbers or De Bruijns as magic factors by [sq][dir]-lookup, where the final precalulated attackset-tables must consider individual mapping of unique numbers.

One possible advantage of this might be to condense further with the trick invented by Sergej Markoff, Tord mentioned with the 70 unique attack pattern. I even think that it should be possible to glue diagonals in the same way as rotated lookups and to index diagonals by (sq-Rank(sq)) & 7) or ((~sq)-Rank(sq)) & 7).

I personally don't bother any longer on 32-bit mode, and to have probably conditional compiled code for 32- and 64-bit mode.

Cheers,
Gerd

Hi Gerd,

I think I get the idea (not the implementation completely yet)

I also see the possibility. Once the index is uniform, one can use a small table to convert the index[File(sq)][64] to index12 ( I've explained that somewhere else ) and get a 5fold reduction of the size of the bitboards table making them 25 KB.

I'll have a test this weekend.

Cheers,

Tony

by **Tony van Roon-Werten** » 18 Mar 2006, 10:00

Hi Gerd,

what a waste

Well, at least I've learned some new things.

The bitfolding of bitboard to file works very easy. The rankattacks needed some correction, but it all works like a charm, and is about 10% faster than my folded multiplication.

I'll have a go with the further compression, but I doubt the reduction will offset the extra table lookup.

Maybe if I can replace it with a magic number ... ?

Tony

by **Gerd Isenberg** » 18 Mar 2006, 20:18

Hi Tony,
yep, sometimes it is didactically worthwhile to be off the track. Those "magics" like 0x01010101 or 0x01020408*X are nice to map masked ray bits to neighboared bits on a rank and to get almost the same occupied states as from rotated bitboards. Multiplying with a single bit or power of two is like shifting left with log2(bit). With more bits set, like M * 0x01010101 is identical to

Code: Select all: (M << 0) + (M << 8) + (M << 16) + (M << 24)

If our M is disjoint from all intermediate shifts, thus there is no bit in M that intersects in M*100H, M*10000H or M*1000000H, no bitwise overflow occurs and we can replace "add" by "(x)or" for the same result.

Since we have sometimes double or even triple pawns, we can not replace the left shifting Kogge-Stone fills

Code: Select all: BB _NorthFill(BB gen) { gen |= (gen << 8); gen |= (gen << 16); gen |= (gen << 32); return gen; }

by gen * 0x0101010101010101 to get white pawn front spans or black pawns back spans :-(

Did you manage to go with only one folded 32-bit multiplication for the file-attacks? My proposal cries for further optimization if combining low- and high-board with some shift left/right 4 before doing the "rotate" multiplication with 0x01020408.

Btw. what was wrong with the rankattacks -
despite the fact that you have probably 0x88 coordinates?

Going to the 70 distinct attack pattern sounds attractive.
A lot of different occupied pattern map to the same attack pattern, specially for close blockers where all the squares behind the blocker are not relevant. But yes, the additional indirection might be too expensive to profit from the additional cache hits.

I somehow doubt you'll find magic numbers for that purpose to skip that additional indirection ;-)

<Brainstorming>
It sounds interesting as well to address not only precalculated attacksets, but precalculated movelists for sliders - which is of course independent on how to determine the occupied raystates.

We can for instance store the serialized list of square indices in a let say fixed sized eight element vector to store up to 7 squares and number of squares attacked. Or we only use the two possible most far away squares of each opposite direction, to check whether they are occupied by a friendly piece or are a legal move/capture target square. On the edge with only one direction (or even zero for the very short diagonals), we simply store the from square which is obviously occupied by a friendly piece.

I have something like following in mind, to implement "branchless" movegen:
Let say we have a rook on d1 and we "address" following rank pattern (arithmetical h-a order):

Code: Select all: gfedcb with occ64 := 010101B attaStruct[hor][d1][occ64]: attackBB := 0..0 00110110B farthest[] := {b1,f1} movelist[0] := {"d1b1","d1c1","d1e1","d1f1", null, null, null, 4} movelist[1] := {"d1c1","d1e1","d1f1", null, null, null, null, 3} movelist[2] := {"d1b1","d1c1","d1e1", null, null, null, null, 3} movelist[3] := {"d1c1","d1e1", null, null, null, null, null, 2}

And to index the appropriate movelist with some boolean math

Code: Select all: idx = isFriendlyPiece(board[farthest[0]]) + isFriendlyPiece(board[farthest[1]]) * 2;

<Brainstorming/>

Tot ziens,

Gerd

by **Gerd Isenberg** » 19 Mar 2006, 11:41

Sometimes it takes some time. I didn't immediatly get your idea to compress the 64-state occupied pattern down to 12. I confused it with the up to 70 attacks patterns for one ray with a "from" square as zero inside those pattern. But, yes i have it now 4*3 ;-)

This is how Sergej's proposal works:

Code: Select all: occ64 = ((occupiedBB & hormask[sq]) * 0x0202020202020202) >> 58; idx70 = AttackIndexHor70[file(sq)][occ64]; // 8 * 64 * 1Byte attack = HorizontalAttacksArray[idx70][rank(sq)]; // 70*8*8Byte

This is how your idea works if i get it right:

Code: Select all: occ64 = ((occupiedBB & hormask[sq]) * 0x0202020202020202) >> 58; occ12 = dDenseOcc64To12[file(sq)][occ64]; // 8 * 64 * 1Byte attack = HorizontalAttacksArray[occ12][sq]; // 12*64*8Byte

Sergej needs 70*8*8 = 4480 bytes per AttacksArray - 17.5K for all 4 lookups(with glued diagonals). You need 12*64*8 = 6144 byte per AttacksArray or 24K for the lookups.

Cheers,
Gerd

by **Tony van Roon-Werten** » 20 Mar 2006, 11:19

Gerd Isenberg wrote:Sometimes it takes some time. I didn't immediatly get your idea to compress the 64-state occupied pattern down to 12. I confused it with the up to 70 attacks patterns for one ray with a "from" square as zero inside those pattern. But, yes i have it now 4*3

000x00
000x01
000x10

100x..

010x..

001x.. ?
This is how Sergej's proposal works:
Code: Select all
occ64 = ((occupiedBB & hormask[sq]) * 0x0202020202020202) >> 58; idx70 = AttackIndexHor70[file(sq)][occ64]; // 8 * 64 * 1Byte attack = HorizontalAttacksArray[idx70][rank(sq)]; // 70*8*8Byte
This is how your idea works if i get it right:
Code: Select all
occ64 = ((occupiedBB & hormask[sq]) * 0x0202020202020202) >> 58; occ12 = dDenseOcc64To12[file(sq)][occ64]; // 8 * 64 * 1Byte attack = HorizontalAttacksArray[occ12][sq]; // 12*64*8Byte
Sergej needs 70*8*8 = 4480 bytes per AttacksArray - 17.5K for all 4 lookups(with glued diagonals). You need 12*64*8 = 6144 byte per AttacksArray or 24K for the lookups.

Cheers,
Gerd

Yes, correct.

I think it must be possible to replace the tablelookup with 8 magic numbers ( maybe even 1)

I think the replacement must work (ie get rid off the lookup) before the compression will actually give a performance benefit.

Tony

by **Gerd Isenberg** » 20 Mar 2006, 12:48

Tony,
to further elaborate your idea, i enumerate all the six occupied piece pattern of a rank (or any other mapped ray) . The attacking sliding piece s (v in the header) is not member of the six occupied set in the 7*1 cases. There is almost one set 1-bit on both directions, the blocking piece and x bits behind don't care, sharing the same unique index. Underscores indicate the offmasked outer squares.

A range of 12 as max of all cases (idx = 0..11, arbitrary order therefor idx*) is sufficient to map all occupied states to all distinct cases. The number of (x) inside one pattern from 0..5 is the exponent to get the number of distinct states with the same hashkey.

Code: Select all: 4*3 (3*4 is reversed) v idx* nstates _000s00_ 0 1 _000s01_ 1 1 _000s1x_ 2 2 _100s00_ 3 1 _100s01_ 4 1 _100s1x_ 5 2 _x10s00_ 6 2 _x10s01_ 7 2 _x10s1x_ 8 4 _xx1s00_ 9 4 _xx1s01_ 10 4 _xx1s1x_ 11 8 sum 32 due one (s == 1) bit is invariant 5*2 (2*5 is reversed): v idx* nstates _0000s0_ 0 1 _0000s1_ 1 1 _1000s0_ 2 1 _1000s1_ 3 1 _x100s0_ 4 2 _x100s1_ 5 2 _xx10s0_ 6 4 _xx10s1_ 7 4 _xxx1s0_ 8 8 _xxx1s1_ 9 8 sum 32 due one (s == 1) bit is invariant 6*1 (1*6 is reversed) v idx* nstates _00000s_ 0 1 _10000s_ 1 1 _x1000s_ 2 2 _xx100s_ 3 4 _xxx10s_ 4 8 _xxxx1s_ 5 16 sum 32 due one (s == 1) bit is invariant 7*1 (1*7 is reversed) v idx* nstates _000000_ 0 1 _100000_ 1 1 _x10000_ 2 2 _xx1000_ 3 4 _xxx100_ 4 8 _xxxx10_ 5 16 _xxxxx1_ 6 32 sum 64 due all bits are variant

While Sergej enumarates all distinct occupied states for the 70 possible attacksets over all possible eight source squares on that ray, thus 7+6+10+12+12+10+5+7==70, and to index with [occ70][rayIndex8(sq)], you sacrifice some space for all source squares with the maximum index range of 12, to index with [occ12][sq64], which saves the rayIndex(sq) calculation.
I still doubt that it is possible to skip the additional indirection by a "supermagic" multiplication, even if we introduce further redundancy by going to a complete four bit range and a 32K attacktable in total.

Suppose we would like to do following...

Code: Select all: occ64 = ((occupiedBB & rankmask[sq]) * 0x0202020202020202) >> 58; occ12 = denseOcc64To12[file(sq)][occ64]; // 8 * 64 * 1Byte attack = HorizontalAttacksArray12_64[occ12][sq]; // 12*64*8Byte

... and now try to find vectors of "supermagics" by random or De Bruijn brute force:

Code: Select all: occ16 = ((occupiedBB & rankmask[sq]) * supermagic[sq]) >> 60; attack = HorizontalAttacksArray64_16[sq][occ16]; // 64*16*8Byte

Further suppose we have some of the above occupied pattern on the 8.rank like, with some "high" x's left of the significant "blocking 1", ...

Code: Select all: msb lsb _x10s1x_ _xx1s1x_ _xxx1s0_ _xxxx1s_ _xxxxx1_

... i see no way to liquidate the influence of the 2,4,8,16 or 32 redundant subpattern (the x-bits) to get four unique upper bits in a 64 bit product. The problem is, for the 8.rank we usually have to multiply with 01H or 02H in the least significant byte of the magic factor.

Gerd

by **Gerd Isenberg** » 20 Mar 2006, 15:31

As an upper limit for additional computation (other than Kogge-Stone fill) we can divide the lookup table size by 8 (16K total), to store only attacks on one "normalized" or "zero"-ray, while we supply some 64-bit shift left to get the attack to the right ray:

instead of

Code: Select all: occ64 = ((occupiedBB & rankmask[sq]) * 0x0202020202020202) >> 58; attack = HorizontalAttacksArray64_64[occ64][sq]; // 64*64*8Byte

one may use

Code: Select all: occ64 = ((occupiedBB & rankmask[sq]) * 0x0202020202020202) >> 58; attack = FirstRankAttacks64_8[occ64][sq&7] << (sq&~7); // 64*8*8Byte

or for fileattacks:

Code: Select all: occ64 = ((occupiedBB & filemask[sq]) * magic[sq&7]) >> 58; attack = AFileAttacks64_8[occ64][sq>>3] << (sq&7); // 64*8*8Byte

A bit more effort with (glued) diagonals, if i get it right on the fly.
I leave the antidiagonal as an exercise to the reader ;-)

Code: Select all: delta = file(sq)-rank(sq); // delta & 7 is glued diagonal index fgrmask = delta >> 31; // remember the sar-trick //shift = (~fgrmask & delta) + (fgrmask & -8*delta); shift = delta - 9*(fgrmask & delta); occ64 = ((occupiedBB & diaomask[sq]) * 0x0202020202020202) >> 58; // oups i hope i got it right now!!! this is tricky - minus shift attack = Dia0Attacks64_8[occ64][(sq-shift)&7)] << shift; attack &= diafmask[sq]; // considers glued diagonals

<Edit>while this seems correct now i suggest a bit cleaner and cheaper branchless version. if rank > file, we have to shift the normalized a1-h8 diagonal attack (rank - file) times up. Otherwise if file >= rank, we have to shift the a1-h8 diagonal attack board right (shl file-rank). The diagonal-relative source square is the minimum of rank or file of that square.

Code: Select all: delta = file - rank; minsq = rank + ((delta>>31)&delta); // min(rank,file) shift = delta - ((delta>>31)&delta)*9; occ64 = ((occupiedBB & diamask[delta+7]) * magic[delta+7]) >> 58; attack = (Dia0Attacks64_8[occ64][minsq] << shift) & diamask[delta+7];

</Edit>
Combining this with the additional 7.x occ70 or 5.x occ12 savings there is huge potential to shrink further far below 4 or even 3K, but then "perfect hashing" of multiple sliders with Kogge-Stone fill comes in mind...

by **Gerd Isenberg** » 21 Mar 2006, 14:06

This is how a combination of Tony's idea via lookup combined with the zeroray compressor might be implemented, intended for 64-bit. For the expensive looking diagonals and antidiagonals, one may expect some ipc-gain by calculating minsq, shift and occ64 independently in parallel, the mul 9 might be done with via AGU aka lea rax,[8*rax+rax].

Code: Select all: i32 delta = file - rank; u32 minsq = rank + ((delta>>31)&delta); // min(rank,file) u32 shift = delta - ((delta>>31)&delta) * 9; u32 occ64 = ((occupiedBB & diamask[delta+7]) * magic[delta+7]) >> 58; u32 occ12 = dense64To12[occ64][minsq]; u64 attack = (Dia0Attacks12x8[occ12][minsq] << shift) & diamask[delta+7];

It takes 12*8 = 96 bitboards per direction. 3KByte for all directions. The dense64To12 array might be shared for all directions with a size of 64*8 = 1/2KByte with chars, 2K with ints. For the rotated alternative we have to consider the 2*16+2*8 mask bitboards = 384Byte as well.

There are at least five alternatives of "rotated" lookups (not to mention the initial 512K with 8-bit occupied states), where memory size competes against more or less computation and additional indirections. For the pure lookup sizes 128K versus 17.5K (Sergej), 24K (Tony), 16K with zeroray and finally 3K by combining Tony's approach with zeroray.

Which is the fastest may vary from program to program for different processors and caches. Btw. why does this thread degenerate to a kind of monologue? More contrariety please ;-)

Happy coding,
Gerd

by **Tony van Roon-Werten** » 21 Mar 2006, 14:23

Gerd Isenberg wrote:This is how a combination of Tony's idea via lookup combined with the zeroray compressor might be implemented, intended for 64-bit. For the expensive looking diagonals and antidiagonals, one may expect some ipc-gain by calculating minsq, shift and occ64 independently in parallel, the mul 9 might be done with via AGU aka lea rax,[8*rax+rax].
Code: Select all
i32 delta = file - rank; u32 minsq = rank + ((delta>>31)&delta); // min(rank,file) u32 shift = delta - ((delta>>31)&delta) * 9; u32 occ64 = ((occupiedBB & diamask[sq]) * 0x0202020202020202) >> 58; u32 occ12 = dense64To12[occ64][minsq]; u64 attack = (Dia0Attacks12x8[occ12][minsq] << shift) & diamask[sq];
It takes 12*8 = 96 bitboards per direction. 3KByte for all directions. The dense64To12 array might be shared for all directions with a size of 64*8 = 1/2KByte with chars, 2K with ints. For the rotated alternative we have to consider the 4*64 mask bitboards = 2KByte as well.

There are at least five alternatives of "rotated" lookups (not to mention the initial 512K with 8-bit occupied states), where memory size competes against more or less computation and additional indirections. For the pure lookup sizes 128K versus 17.5K (Sergej), 24K (Tony), 16K with zeroray and finally 3K by combining Tony's approach with zeroray.

Which is the fastest may vary from program to program for different processors and caches. Btw. why does this thread degenerate to a kind of monologue? More contrariety please

Happy coding,
Gerd

No need,

you explain my ideas very well

Did I have some more ideas ?

Tony

by **Gerd Isenberg** » 21 Mar 2006, 14:51

Tony van Roon-Werten wrote:No need,

you explain my ideas very well

Did I have some more ideas ?

Tony

yes of course, but most are secret - or i need a catalyzer.
What comes in mind for instance is quadbitboard stuff in conjunction with multiplexed SSE2-Kogge-Stone occluded fills, aka generating up to 15 fills with a two XMM-register generators. Another interesting idea is an analogue evaluation with very fast ?ops getting complex eval scores as output voltage in some pico seconds - for interactive tuning (during the game?) i can imagine a moog-synthesizer like panel or patch board with a lot of potentiometers and interelectrode capacitances where laying of hands may help if your king is in some danger ;-)

Gerd

by **Tony van Roon-Werten** » 21 Mar 2006, 15:19

Voice recognition and shouting to your machine seems easier and more fun.

Tony

by **Gerd Isenberg** » 21 Mar 2006, 15:33

Yes, why not - but results get a bit less deterministic ;-)

http://sprott.physics.wisc.edu/PICKOVER/pc/moogrc.html

Here you can see fotos of some analogue chess computer prototypes, including vocoder aka voice recognizer. The new mainframe model (lower left foto) consist the new moog king safety module (lower left cabinet) and the revolutionary analogue quiscence search (upper left cabinet) with it's superfast static exchange evaluators.

The next generation will even contain a "zero time" four ply full minimax-search in analogue hardware - so lightning fast, that the lack of a hashtable don't cares. As keyboards are usually intended to enter binary digits, musical chess players may also be able to interact with the apparatus on some higher abstraction level.

by **Gerd Isenberg** » 21 Mar 2006, 16:52

And finally the (corrected) Markoff compressor in conjunction with zeroray:

Code: Select all: i32 delta = file - rank; u32 minsq = rank + ((delta>>31)&delta); // min(rank,file) u32 shift = delta - ((delta>>31)&delta) * 9; u32 occ64 = ((occupiedBB & diamask[delta+7]) * magic[delta+7]) >> 58; u32 occ70 = dense512To70[occ64][minsq]; u64 attack = (seventyDia0Attacks[occ70] << shift) & diamask[delta+7];

70 Bitboards per direction, 2240 bytes in total.

Winboard Forum

Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Re: Compact Bitboard Attacks

Who is online