Winboard Forum

by **Onno Garms** » 23 Sep 2007, 19:57

Hello,

almost a year ago, Pradu and others suggested a new way to compute attack bitboards: magic multipliers. In the meantime it got quiet about this topic. As far as I remember the discussions, there was the suggestion to use precomputed move lists together with magic, but nobody posted how to do this.

Precomputed move lists obviously need enormous amounts of memory, so let's consider the easier task to count moves by magic (useful for eval).

The obvious way to do this is:
1. get attack bitboard by magic
2. remove own pieces from attack bitboard
3. do a popcount

Unfortunately, this is fairly slow. Are there better ways to count moves with bitboards (magic or non-magic)?

In above approach, if we replace the popcount by another magic lookup
- we have a different pattern to look up than in step one, so we do a new computation of an index and for the lookup we access memory at a diffent address
- the squares at the border of the board are relevant, so the masks have up to four more bits.

Can these problems be solved?

by **Edsel Apostol** » 24 Sep 2007, 03:26

Hi Onno,

The new version of Glaurung is already using a pre-computed table for computing mobility. It is using magic method for indexing. He is using the same index but accessing a different table for mobility.

I read from one of his posts that it is a little efficient that way but it is a little inaccurate.

Public release of Crafty also is doing this, though it is in rotated bitboards style.

I have tried it also using the kindergarten bitboards approach and based on my tests, I doubt if the speed-up compensates for the inaccuracy of computing the mobility. I am using now the one with popcount.

Edsel Apostol (Twisted Logic author)

by **Pradu** » 24 Sep 2007, 09:39

Edsel Apostol wrote:I have tried it also using the kindergarten bitboards approach and based on my tests, I doubt if the speed-up compensates for the inaccuracy of computing the mobility. I am using now the one with popcount.

I use a pop-count for mobility as well because like you said it'll be more accurate because you can modify the attacks bitboards and count different types of moves separately.

by **Grant Osborne** » 24 Sep 2007, 11:31

Hi

Just thinking out loud, pre-computed moves could be stored in an 'int' with each bit that is set representing an offset from the 'from' square dividing the 'int' into 4 sections for each direction.

e.g. 00000010 00001110 00000000 01111110 = 0x020E007E
minus 8 plus 8 minus 1 plus 1

Then maybe the rook moves, for example, could be extracted like this

Code: Select all: int moveList = rMagic(square); for (i = 0; i < 32; i = i + 8) { moves = (moveList >> i) & 255; switch (i >> 3) { case 0: for (j = 1; j < 8; j++) { if (moves & (1 << j)) storeMove(rook, from, from + j); else break; } break; case 1: for (j = 1; j < 8; j++) { if (moves & (1 << j)) storeMove(rook, from, from - j); else break; } break; case 2: for (j = 1; j < 8; j++) { if (moves & (1 << j)) storeMove(rook, from, from + (j << 3)); else break; } break; case 3: for (j = 1; j < 8; j++) { if (moves & (1 << j)) storeMove(rook, from, from - (j << 3)); else break; } break; } }

Would this not half the memory required, or am I missing something?

Grant

by **Gerd Isenberg** » 24 Sep 2007, 21:34

Hi Grant,

how do you handle blocked pieces are own or capture targets? Another condition?

What about one byte only for both directions of one ray?
Another factor two less memory...

01101000 01110111

Code: Select all: file = from & 7; rank = from >> 3; for (j = file + 1; j < 8; j++) { if (rankMoves & bit[j]) storeMove(rook, from, 8*rank + j); else break; } for (j = file - 1; j >= 0; j--) { if (rankMoves & bit[j]) storeMove(rook, from, 8*rank + j); else break; } for (j = rank + 1; j < 8; j++) { if (fileMoves & bit[j]) storeMove(rook, from, 8*j + file); else break; } for (j = rank - 1; j >= 0; j--) { if (fileMoves & bit[j]) storeMove(rook, from, 8*j + file); else break; }

But seriously, the classical mailbox approach traversing precalculated moves by nextsquare and nextdir pointers/indices looks cheaper with some less memory. There you have the condition of a blocked square to go to nextdir (instead of nextsq) and if the square is blocked, whether it is an opponent piece as capture target.

Bitboard movegen is about traversing disjoint sets, e.g. only capture moves for a fast cutoff or inside qsearch.

Gerd

by **Onno Garms** » 25 Sep 2007, 18:42

Edsel Apostol wrote:The new version of Glaurung is already using a pre-computed table for computing mobility. It is using magic method for indexing. He is using the same index but accessing a different table for mobility.

Interesting. Unfortunately I only have 1.2.1. The Glaurung website is down and this seems to last longer. Is the recent Glaurung available somewhere?

Otherwise, could you email me the sources to <mysurname><at><gmx><dot><German TLD>.
I think they are GPL'd, so this is legal.

Edit: I found Glaurung source, no need to email.
http://packages.qa.debian.org/g/glaurung.html

Edit2: For sliders, Glaurung seems to count only non-captruring moves. This can be done by looking up in a precomputed table of popcounts of the attack bitboards.

by **Tord Romstad** » 26 Sep 2007, 10:53

Onno Garms wrote:The Glaurung website is down and this seems to last longer.

That's right. It will be back soon.

Edit: I found Glaurung source, no need to email.
http://packages.qa.debian.org/g/glaurung.html

Great!

Edit2: For sliders, Glaurung seems to count only non-captruring moves.

This is not entirely correct: Glaurung counts attacked squares, empty or not. This is a very poor form of mobility evaluation, because it does not distinguish between squares occupied by friendly and enemy pieces. It is basically just a cheap surrogate for real mobility; the only reason I do it like this is because I haven't found a fast way to compute mobility with bitboards. Mobility evaluation seems to be one of the areas where bitboards are still behind other board representations.

Tord

by **Gerd Isenberg** » 26 Sep 2007, 18:17

Tord Romstad wrote:Mobility evaluation seems to be one of the areas where bitboards are still behind other board representations.

Hi Tord,

how do other board representations compute mobility more efficiently?

My favorite technique is still based on bit[64]*byte[64] simd-dot-product. bit[64] is an masked attack-set (e.g. no opponent pawn-attacks and no immobile own pawns) and byte[64] is a weight matrix or attack-square-table for various pieces. One may even use an array of precalculated weight-matrices, indexed by king-placements and even by some pawn-structure enums from pawn-hash. Not to mention fill-stuff to get trapped pieces or progressive mobility.

Gerd

by **Pradu** » 27 Sep 2007, 01:13

Tord Romstad wrote:Mobility evaluation seems to be one of the areas where bitboards are still behind other board representations.

Future processors will also have the SSE4 POPCNT(pdf-page 161) instruction. Apparently AMD's K10 will also have popcnt but I'm not too sure on this one.

by **Edsel Apostol** » 27 Sep 2007, 03:37

Hi Gerd,

Can you elaborate further with your method? I am searching for a faster way to evaluate mobility and it seems your implementation is always the best when it comes to anything related to bitboards.

This is a little out of topic, but I just want to insert it here as I don't want to create another thread. Is the routine "bswap" included in the C libraries? If not, how am I going to access it. I can create a simple routine in C equivalent to this bswap but I doubt if it would be fast.

It looks like this:

static const u64 m1 = 0x5555555555555555ULL;
static const u64 m2 = 0xAAAAAAAAAAAAAAAAULL;
static const u64 m3 = 0x3333333333333333ULL;
static const u64 m4 = 0xCCCCCCCCCCCCCCCCULL;
static const u64 m5 = 0x0F0F0F0F0F0F0F0FULL;
static const u64 m6 = 0xF0F0F0F0F0F0F0F0ULL;
static const u64 m7 = 0x00FF00FF00FF00FFULL;
static const u64 m8 = 0xFF00FF00FF00FF00ULL;
static const u64 m9 = 0x0000FFFF0000FFFFULL;
static const u64 m10 = 0xFFFF0000FFFF0000ULL;
static const u64 m11 = 0x00000000FFFFFFFFULL;
static const u64 m12 = 0xFFFFFFFF00000000ULL;

u64 bswap(u64 b){
b = ((b&m1) << 1) | ((b&m2) >> 1);
b = ((b&m3) << 2) | ((b&m4) >> 2);
b = ((b&m5) << 4) | ((b&m6) >> 4);
b = ((b&m7) << 8) | ((b&m8) >> 8);
b = ((b&m9) << 16) | ((b&m10) >> 16);
return ((b&m11) << 32) | ((b&m12) >> 32);
}

Edsel Apostol

by **Onno Garms** » 27 Sep 2007, 06:45

Edsel Apostol wrote:Is the routine "bswap" included in the C libraries? If not, how am I going to access it. I can create a simple routine in C equivalent to this bswap but I doubt if it would be fast.

In my Linux distribution, there is a function called bswap_64 available in the header byteswap.h. However, unlike your function, it swaps only the bytes (mirrors bitboard between row 4 and row 5). It does not swap the bits inside the bytes (rotate bitboard by 180 degree).

In Visual Studio, the equivalent to the Linux bswap_64 is called _byteswap_uint64.

by **Gerd Isenberg** » 27 Sep 2007, 08:17

Edsel Apostol wrote:Hi Gerd,
Can you elaborate further with your method? I am searching for a faster way to evaluate mobility and it seems your implementation is always the best when it comes to anything related to bitboards.

This is a little out of topic, but I just want to insert it here as I don't want to create another thread. Is the routine "bswap" included in the C libraries? If not, how am I going to access it. I can create a simple routine in C equivalent to this bswap but I doubt if it would be fast.

Hi Edsel,

Thanks, but the solution is not portable, as it uses SSE2 (MMX also possible). Parallel brute force, expanding 64-bits to 64-bytes in four xmm-registers. I guess similar is possible with other SIMD architectures like ALTIVEC.

Here the version using intrinsics:
http://64.68.157.89/forum/viewtopic.php ... 7dc#141852

90 degree rotated weights (swapped rows and columns) are a bit cheaper. Here an inlined asm version to get the idea. Expanding or extending bits to bytes. Copy one bitboard to eight in four xmm-registers. Mask the consecutive bits and compare it with all eight bits of a byte each. The dot product is finally done by anding {-1,0}-byte masks with the weights and finally to add all 64-bytes together. Weights are unsigned here - signed weights work similar except the final horizontal add.

Code: Select all: int dotProductRotated(u64 bb, u8 *rorWeights) { static const u64 CACHE_ALIGN masks[8] = { 0x0101010101010101, 0x0202020202020202, 0x0404040404040404, 0x0808080808080808, 0x1010101010101010, 0x2020202020202020, 0x4040404040404040, 0x8080808080808080, }; __asm { movq xmm0, [bb] lea edx, [masks] mov eax, [rorWeights] // extend bits to bytes punpcklqdq xmm0, xmm0 pxor xmm4, xmm4 ; zero movdqa xmm1, xmm0 movdqa xmm2, xmm0 movdqa xmm3, xmm0 pandn xmm0, [edx+0*16] pandn xmm1, [edx+1*16] pandn xmm2, [edx+2*16] pandn xmm3, [edx+3*16] pcmpeqb xmm0, xmm4 pcmpeqb xmm1, xmm4 pcmpeqb xmm2, xmm4 pcmpeqb xmm3, xmm4 // multiply by "and" with -1 or 0 pand xmm0, [eax+0*16] pand xmm1, [eax+1*16] pand xmm2, [eax+2*16] pand xmm3, [eax+3*16] // add all bytes (with saturation) paddusb xmm0, xmm1 paddusb xmm0, xmm2 paddusb xmm0, xmm3 psadbw xmm0, xmm4 ; horizontal add 2 * 8 byte pextrw edx, xmm0, 4 ; extract both intermediate sums to gp movd eax, xmm0 add eax, edx ; final add } }

Whether this methods pay off? Due to the flexibility there is a great chance to miss-use it with wrong tuned weights. Also considering attacks to own pieces (xrays) - must not that bad.

On bswap (flip bytes, not reverse bitboards) - yes there are inline asm "intrinsics" in appropriate linux headers as well. For other than x86/64 architectures - there is a parallel prefix shift version as well, but apparently to expensive. Is suggest a separte header with such conditional stuff...

Code: Select all: #ifdef ..VC __forceinline u64 flip(u64 b) {return _byteswap_uint64(b);} #else inline u64 flip(u64 b) {return bswap_64(b);} #endif u64 diagonalAttacks(u64 occ, u32 sq) { u64 forward, reverse; forward = occ & smsk[sq].diagonalMaskEx; reverse = flip(forward); forward -= smsk[sq].bitMask; reverse -= flip(smsk[sq].bitMask); forward ^= flip(reverse); forward &= smsk[sq].diagonalMaskEx; return forward; } u64 antiDiagAttacks(u64 occ, u32 sq) { u64 forward, reverse; forward = occ & smsk[sq].antidiagMaskEx; reverse = flip(forward); forward -= smsk[sq].bitMask; reverse -= flip(smsk[sq].bitMask); forward ^= flip(reverse); forward &= smsk[sq].antidiagMaskEx; return forward; } u64 bishopAttacks(u64 occ, u32 sq) { return diagonalAttacks (occ, sq) + antiDiagAttacks (occ, sq); }

Gerd

by **Tord Romstad** » 27 Sep 2007, 12:13

Gerd Isenberg wrote:Hi Tord,

how do other board representations compute mobility more efficiently?

With my old mailbox board representation, I simply looped over all attacked squares, while counting all empty squares and squares occupied by enemy pieces. At the end, I subtracted the number of attacked squares occupied by friendly pawns. This was quite fast. The best way I have found to compute this with bitboards is something like this:

Code: Select all: Bitboard b = pos.rook_attacks(sq); mob = count_1s(b & (pos.empty_squares() | pos.pieces_of_color(them))); mob -= count_1s(b & pos.pawns_of_color(us));

This is very slow, because of the need for sliding attack generation and to calls to count_1s().

My favorite technique is still based on bit[64]*byte[64] simd-dot-product. bit[64] is an masked attack-set (e.g. no opponent pawn-attacks and no immobile own pawns) and byte[64] is a weight matrix or attack-square-table for various pieces. One may even use an array of precalculated weight-matrices, indexed by king-placements and even by some pawn-structure enums from pawn-hash. Not to mention fill-stuff to get trapped pieces or progressive mobility.

For that sort of stuff, bitboards may indeed be advantageous -- but I am too stupid to make effective use of such complicated evaluation terms (I know this from experience; I used to do things like what you describe back when my program was 500 Elo points weaker than today). I wouldn't use a very complicated mobility evaluation even if I could get it with zero CPU time. Keeping it simple and straightforward works best for me.

Tord

by **Tord Romstad** » 27 Sep 2007, 12:21

Pradu wrote:
Tord Romstad wrote:Mobility evaluation seems to be one of the areas where bitboards are still behind other board representations.
Future processors will also have the SSE4 POPCNT(pdf-page 161) instruction.

Some future processors, maybe. I'm trying to write a general chess program, not just a chess program for a particular CPU or family of CPUs. Future CPUs are particularly uninteresting to me, because nobody uses them yet. If I wanted to optimize my program for a particular CPU, I would choose a CPU which was state of the art one or two years ago, because this is the type of CPU the average computer user is likely to own. Most people don't upgrade or replace their computers every time a new and faster CPU is released.

Tord

by **Gerd Isenberg** » 27 Sep 2007, 16:55

Tord Romstad wrote:With my old mailbox board representation, I simply looped over all attacked squares, while counting all empty squares and squares occupied by enemy pieces. At the end, I subtracted the number of attacked squares occupied by friendly pawns. This was quite fast. The best way I have found to compute this with bitboards is something like this:
Code: Select all
Bitboard b = pos.rook_attacks(sq); mob = count_1s(b & (pos.empty_squares() | pos.pieces_of_color(them))); mob -= count_1s(b & pos.pawns_of_color(us));

This is very slow, because of the need for sliding attack generation and to calls to count_1s().

I see - tiny loops versus getting attack-sets and popcount aka dot-product. And if the popcount is implemented loopwise - I agree about the efficiency. That is why a native popcnt instruction is so desired. In your sample - you really don't need two population counts, but only one with the "right" set (Edit: oups that is wrong of course, two disjount sets to count) - another idea is to count attacks of all bishops/knights with the odd/major trick.

This is almost 1st order stuff and one will have some 2nd order stuff considering interactions, candidates, passers and kingsafety etc. anyway. I remember from Aron Nimzowitsch's "My System" - that it is important to cover own strong squares or pawns and pieces multiple times - or to attack weak opponent squares and pawns, but to defend own weak pawns, except the base of a pawn-chain - yields in passive positions.

For that sort of stuff, bitboards may indeed be advantageous -- but I am too stupid to make effective use of such complicated evaluation terms (I know this from experience; I used to do things like what you describe back when my program was 500 Elo points weaker than today). I wouldn't use a very complicated mobility evaluation even if I could get it with zero CPU time. Keeping it simple and straightforward works best for me.

I don't think that you are too stupid - keeping things simple is a promising, pragmatical approach to get a smoothed, maintainable evaluation without too much noise and redundancies.

Gerd

by **Harald Johnsen** » 28 Sep 2007, 08:14

Do you really need the 2 popcount ? Here is what I am doing, is my code wrong ?

Code: Select all: // we will compute mobility thru our own pieces but not thru our pawns // and not thru king so trapped rook does not think it has some mobility bitboard xrayblocker = pcon->pieces[PAWN|side] | pcon->pieces[KING|side]; // - Rook - for (pc = pcon->pieces[ROOK|side]; pc; ) { sqr = PopLSB(&pc); moves = Rmagic(sqr, pcon->pieces[B_OCCU|xside]|xrayblocker) &~ xrayblocker; // dynamicVal.attacks |= moves; dynamicVal.c_move += popCount(moves); // dynamicVal.c_offense += popCount(moves & offenseMap); // dynamicVal.c_king += popCount(moves & kingSquare); }

HJ.

by **Gerd Isenberg** » 28 Sep 2007, 10:09

Harald Johnsen wrote:Do you really need the 2 popcount ?
HJ.

While most count the "safe" squares, not occupied by own pawns and pieces, Tord's idea is to even subtract attacked squares occupied by own pawns, to get an even better idea of trapped pieces or bad bishops. One scheme might be as wrong as the other, but nevertheless Tord's idea looks interesting - and one may eventually improve it. Considering the movability of the blocking pawns or whether they are isolated or base of a chain etc.

A rook behind an own passer or candidate is most likely good.
A rook on e1 behind an own rammed isolani e4 is weak and passive.
But if the rammed e4-pawn is not isolated but base of a chain e4:f5 or e4:d5, the e1-rook protecting the base is fine again. Whether this should considerd in a moblity term is another question.

Gerd

Winboard Forum

Magic and precomputation

Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Re: Magic and precomputation

Who is online