Winboard Forum

by **Eduardo Waghabi** » 18 Jul 2005, 22:28

Hi there,

I was messing around with the TSCP inline code for the first bit set:

Code: Select all: __inline int first_one(u64 bits) { __asm { mov edx, 0 bsf eax, dword ptr bits jnz done mov edx, 32 bsf eax, dword ptr bits+4 jnz done mov eax, 32 done: add eax, edx } }

and trying to port it into GCC inline asm code so I could compile it under Cygwin's GCC. This is what I've got:

Code: Select all: int first_one(TBitBoard bitboard) { int idxBit = -1; asm(" bsfl %1, %0 \n\t" // bsf eax, dword ptr b " jnz done \n\t" // jnz done " movl $0x20, %2 \n\t" // mov edx, 32 " bsfl 4(%1), %0 \n\t" // bsf eax, dword ptr [b+4] " jnz done \n\t" // jnz done " movl $0x20, %0 \n\t" // mov eax, 32 "done: add %2 , %0 \n\t" // add edx, eax : "=r" (idxBit) // out reg: %0=idxBit : "r" (bitboard), "r" (0) // in regs: %1=bitboard, %2=0 : "cc" // I mess with conditions codes ); return idxBit; }

Quite obviously it didn't work. Now can anyone please tell me why?

Thanks!

by **Alessandro Scotti** » 18 Jul 2005, 22:55

This is what I use in Kiwi:

Code: Select all: inline int bitScanForward( Uint64 data ) { int bit; asm( " testl %%eax, %%eax \n\t" " jz 1f \n\t" " bsf %%eax, %0 \n\t" " jmp 2f \n\t" "1: \n\t" " movl $-1, %0 \n\t" " testl %%edx, %%edx \n\t" " jz 2f \n\t" " bsf %%edx, %0 \n\t" " addl $32, %0 \n\t" "2: \n\t" : "=&r" (bit) : "A" (data) ); return bit; }

This function returns -1 if no bit is set. (Eventually, I decided to compile without inline assembly though.)
Recently, Dann Corbit showed me a great "asm-free" implementation:

Code: Select all: /* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Matt Taylor's implementation of the de Bruijn bitscan: >From a post on CCC: http://chessprogramming.org/cccsearch/ccc.php?art_id=306789 Subject : Re: There is huge potential to improve further Posted by : Matt Taylor on July 16, 2003 at 22:44:43 Which is indirectly a reference to this work: "Using de Bruijn Sequences to Index a 1 in a Computer Word" Charles E.Leiserso Harald Prokop Keith H, Randall MIT Laboratory for Computer Science, Cambridge, MA 02139, USA July 7, 1998 http://supertech.csail.mit.edu/papers/debruijn.ps <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< */ const int lsz64_tbl[64] = { 0, 31, 4, 33, 60, 15, 12, 34, 61, 25, 51, 10, 56, 20, 22, 35, 62, 30, 3, 54, 52, 24, 42, 19, 57, 29, 2, 44, 47, 28, 1, 36, 63, 32, 59, 5, 6, 50, 55, 7, 16, 53, 13, 41, 8, 43, 46, 17, 26, 58, 49, 14, 11, 40, 9, 45, 21, 48, 39, 23, 18, 38, 37, 27, }; int BitScan_MT_deBruijn (Bitboard bb) { const Bitboard lsb = (bb & -(long long) bb) - 1; const unsigned int foldedLSB = ((int) lsb) ^ ((int) (lsb >> 32)); return lsz64_tbl[foldedLSB * 0x78291ACF >> 26]; }

Note that this version won't work with empty (zero) bitboards, and I have a really hard time trying to understand *how* it works in all the remaining cases!

by **Gian-Carlo Pascutto** » 19 Jul 2005, 00:25

It's basically a perfect hash.

You transform the input to the same input but with only 1 bit set, then you map the 64 possible values to 1..64.

The construction of the hash with multiply+shift is some neat mathematical theory.

by **Dann Corbit** » 19 Jul 2005, 00:30

The postscript paper (upthread) explains it.

It is easy to construct a perfect hash.

It is not easy to create a perfect hash that is so cheap to compute.

by **Pradu** » 19 Jul 2005, 00:51

You probably want a fast C implementation of a population count function as well:

Code: Select all: /*iterativePopCount() *an iterative population count of 1 bits in a quadword *this population count is better than the regular population count if the *number of 1 bits is sparce (usually less than 5 bits) * *@param b - the quadword to be counted *@returns the number of 1 bits in b */ int iterativePopCount(U64 b) { int n; for (n = 0; b != 0; n++, b &= (b - 1)); return n; }

An explanation of the iterative population count is can be found here.

Basically b &= (b - 1) clears the first bit and it is trivial from there.

And here is a non-interative population count for densely populated quadwords from Hacker's Memory collection of programming tips:

Code: Select all: /*popCount() *a noniterative population count of 1 bits in a quadword * *@param b - the quadword to be counted *@returns the number of 1 bits in b */ #define m1 0x5555555555555555ULL #define m2 0x3333333333333333ULL int popCount(const U64 b) { int n; const U64 a = b - ((b >> 1) & m1); const U64 c = (a & m2) + ((a >> 2) & m2); n = ((int) c) + ((int) (c >> 32)); n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); n = (n & 0xFFFF) + (n >> 16); n = (n & 0xFF) + (n >> 8); return n; }

Both of these work fine for me.

by **Eduardo Waghabi** » 19 Jul 2005, 03:29

Many thanks for the feedback. The perfect hash trick is really neat.

As for my code, I eventually figured out I should pass &bitboard as opcode %1, as I was using the ptr operand (). I've changed the functions, to make it a little bit mine, so for who may care:

FirstBit:

Code: Select all: int BitMenosSignificativo(TBitBoard bitboard) { int idxBit = -1; asm(" bsfl (%1), %0 \n\t" " jnz ebsf \n\t" " movl $-0x01, %0 \n\t" " orl $0,4(%1) \n\t" " jz ebsf \n\t" " bsfl 4(%1), %0 \n\t" " addl $0x20, %0 \n\t" "ebsf: nop \n\t" :"=r" (idxBit):"r" (&bitboard)); return idxBit; }

LastBit:

Code: Select all: int BitMaisSignificativo(TBitBoard bitboard) { int idxBit = -1; asm(" bsrl 4(%1), %0 \n\t" " jnz ebsr1 \n\t" " movl $-0x01, %0 \n\t" " orl $0,(%1) \n\t" " jz ebsr \n\t" " bsrl (%1), %0 \n\t" " jnz ebsr \n\t" "ebsr1: addl $0x20, %0 \n\t" "ebsr: nop \n\t" :"=r" (idxBit):"r" (&bitboard)); return idxBit; }

Those last two return -1 if the bitboard is empty. And if you know for sure the bitboard is not empty, here are faster versions:

FirstBit:

Code: Select all: int BitMenosSigPreench(TBitBoard bitboard) { int idxBit = -1; asm(" bsfl (%1), %0 \n\t" " jnz efbsf \n\t" " bsfl 4(%1), %0 \n\t" " addl $0x20, %0 \n\t" "efbsf: nop \n\t" :"=r" (idxBit):"r" (&bitboard)); return idxBit; }

LastBit:

Code: Select all: int BitMaisSigPreench(TBitBoard bitboard) { int idxBit = -1; asm(" bsrl 4(%1), %0 \n\t" " jnz efbsr1 \n\t" " bsrl (%1), %0 \n\t" " jnz efbsr \n\t" "efbsr1: addl $0x20, %0 \n\t" "efbsr: nop \n\t" :"=r" (idxBit):"r" (&bitboard)); return idxBit; }

Hey, I might start to like this inline assembly stuff.

by **Pradu** » 19 Jul 2005, 03:39

Hey, I might start to like this inline assembly stuff.

It gets ugly and it is tedious to make the source portable. Atleast thats my view on it...

by **Reinhard Scharnagl** » 19 Jul 2005, 07:36

Hi Pradu,

the popCount() function is suboptimal - am I rigth ?

Reinhard.

by **Dann Corbit** » 19 Jul 2005, 21:59

Reinhard Scharnagl wrote:Hi Pradu,

the popCount() function is suboptimal - am I rigth ?

Reinhard.

Some alternatives:

Code: Select all: #include <assert.h> typedef unsigned long long bitboard; extern unsigned int t0_and_masks(bitboard x); extern unsigned int t1_and_masks(bitboard i); extern unsigned int t2_mult_masks(bitboard x); extern unsigned int t3_rmlsbsub(bitboard n); extern unsigned int t4_rmlsbmask(bitboard n); extern unsigned int t5_testlsb(bitboard n); extern unsigned int t6_testmsb(bitboard n); extern unsigned int t7_testsignandshift(bitboard n); extern unsigned int t8_testeachbit(bitboard n); extern unsigned int t9_testeachbit1shl(bitboard n); extern unsigned int tA_tableshift(bitboard n); extern unsigned int tB_tableuchar(bitboard n); extern unsigned int tC_tableshiftcast(bitboard n); extern unsigned int tD_itableshift(bitboard n); extern unsigned int tE_itableuchar(bitboard n); extern unsigned int tF_itableshiftcast(bitboard n); extern unsigned int tG_parallel(bitboard n); extern unsigned int tH_hamming(bitboard w); unsigned t0_and_masks(bitboard x) { assert(x); x = (x >> 1 & 0x5555555555555555) + (x & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x = ((x >> 8) + x) & 0x00ff00ff00ff00ff; x = ((x >> 16) + x) & 0x0000ffff0000ffff; return (unsigned) ((x + (x >> 32)) & 0xff); } unsigned t1_and_masks(bitboard i) { unsigned j; assert(i); i = (i & 0x5555555555555555) + (i >> 1 & 0x5555555555555555); i = (i & 0x3333333333333333) + (i >> 2 & 0x3333333333333333); i = ((i >> 4) + i) & 0x0f0f0f0f0f0f0f0f; j = (unsigned) (i + (i >> 32)); j += j >> 16; return (j + (j >> 8)) & 0xff; } /* * This function counts the bits in a long long. * * It removes the lsb and counting the number of times round the loop. * The expression (n & -n) yields the lsb of a number, * but it only works on 2's compliment machines. */ unsigned t3_rmlsbsub(bitboard n) { unsigned count; assert(n); for (count = 0; n; n -= (n & -n)) count++; return count; } unsigned t4_rmlsbmask(bitboard n) { unsigned count; assert(n); for (count = 0; n; count++) n &= (n - 1); /* take away lsb */ return count; } /* * This function counts the bits in a long long. * * It works by shifting the number down and testing the bottom bit. */ unsigned t5_testlsb(bitboard n) { unsigned count; assert(n); for (count = 0; n; n >>= 1) count += (n & 1); return count; } /* * This function counts the bits in a long long. * * It works by shifting the number left and testing the top bit. * On many machines shift is expensive, so it uses a cheap addition instead. */ unsigned t6_testmsb(bitboard n) { unsigned count; assert(n); for (count = 0; n; n += n) if (n & ~(~(bitboard) 0 >> 1)) count++; return count; } unsigned t7_testsignandshift(bitboard n) { unsigned count; assert(n); for (count = 0; n; n <<= 1) if ((long long) n < 0) count++; return (count); } /* * This function counts the bits in a long long. * * It works by masking each bit. * This is the second most intuitively obvious method, * and is independent of the number of bits in the long long. */ unsigned t8_testeachbit(bitboard n) { unsigned count; bitboard mask; assert(n); count = 0; for (mask = 1; mask; mask += mask) if (n & mask) count++; return count; } /* * This function counts the bits in a long long. * * It works by masking each bit. * This is the most intuitively obvious method, * but how do you a priori know how many bits in the long long? * (except for ''sizeof(long long) * CHAR_BITS'' expression) */ unsigned t9_testeachbit1shl(bitboard n) { unsigned count; unsigned bit; assert(n); count = 0; for (bit = 0; bit < 64; ++bit) if (n & ((bitboard) 1 << bit)) count++; return count; } static const char bits_in_byte[256] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, }; unsigned tA_tableshift(bitboard n) { assert(n); return (bits_in_byte[n & 0xff] + bits_in_byte[(n >> 8) & 0xff] + bits_in_byte[(n >> 16) & 0xff] + bits_in_byte[(n >> 24) & 0xff] + bits_in_byte[(n >> 32) & 0xff] + bits_in_byte[(n >> 40) & 0xff] + bits_in_byte[(n >> 48) & 0xff] + bits_in_byte[n >> 56] ); } unsigned tB_tableuchar(bitboard n) { unsigned char *p = (unsigned char *) &n; assert(n); return (bits_in_byte[p[0]] + bits_in_byte[p[1]] + bits_in_byte[p[2]] + bits_in_byte[p[3]] + bits_in_byte[p[4]] + bits_in_byte[p[5]] + bits_in_byte[p[6]] + bits_in_byte[p[7]] ); } unsigned tC_tableshiftcast(bitboard n) { assert(n); return bits_in_byte[(unsigned char) n] + bits_in_byte[(unsigned char) (n >> 8)] + bits_in_byte[(unsigned char) (n >> 16)] + bits_in_byte[(unsigned char) (n >> 24)] + bits_in_byte[(unsigned char) (n >> 32)] + bits_in_byte[(unsigned char) (n >> 40)] + bits_in_byte[(unsigned char) (n >> 48)] + bits_in_byte[(unsigned char) (n >> 56)]; } unsigned tD_itableshift(bitboard n) { assert(n); return (bits_in_byte[n & 0xff] + bits_in_byte[(n >> 8) & 0xff] + bits_in_byte[(n >> 16) & 0xff] + bits_in_byte[(n >> 24) & 0xff] + bits_in_byte[(n >> 32) & 0xff] + bits_in_byte[(n >> 40) & 0xff] + bits_in_byte[(n >> 48) & 0xff] + bits_in_byte[(n >> 56)] ); } unsigned tE_itableuchar(bitboard n) { unsigned char *p = (unsigned char *) &n; assert(n); return (bits_in_byte[p[0]] + bits_in_byte[p[1]] + bits_in_byte[p[2]] + bits_in_byte[p[3]] + bits_in_byte[p[4]] + bits_in_byte[p[5]] + bits_in_byte[p[6]] + bits_in_byte[p[7]] ); } unsigned tF_itableshiftcast(bitboard n) { assert(n); return bits_in_byte[(unsigned char) n] + bits_in_byte[(unsigned char) (n >> 8)] + bits_in_byte[(unsigned char) (n >> 16)] + bits_in_byte[(unsigned char) (n >> 24)] + bits_in_byte[(unsigned char) (n >> 32)] + bits_in_byte[(unsigned char) (n >> 40)] + bits_in_byte[(unsigned char) (n >> 48)] + bits_in_byte[(unsigned char) (n >> 56)]; } unsigned tG_parallel(bitboard n) { assert(n); n = ((n) & (0xffffffffffffffff / ((1 << ((1 << (0)))) + 1))) + (((n) >> ((1 << (0)))) & (0xffffffffffffffff / ((1 << ((1 << (0)))) + 1))); n = ((n) & (0xffffffffffffffff / ((1 << ((1 << (1)))) + 1))) + (((n) >> ((1 << (1)))) & (0xffffffffffffffff / ((1 << ((1 << (1)))) + 1))); n = ((n) & (0xffffffffffffffff / ((1 << ((1 << (2)))) + 1))) + (((n) >> ((1 << (2)))) & (0xffffffffffffffff / ((1 << ((1 << (2)))) + 1))); n = ((n) & (0xffffffffffffffff / ((1 << ((1 << (3)))) + 1))) + (((n) >> ((1 << (3)))) & (0xffffffffffffffff / ((1 << ((1 << (3)))) + 1))); n = ((n) & (0xffffffffffffffff / ((1 << ((1 << (4)))) + 1))) + (((n) >> ((1 << (4)))) & (0xffffffffffffffff / ((1 << ((1 << (4)))) + 1))); n = ((n) & (0xffffffffffffffff / ((1 << ((1 << (5)))) + 1))) + (((n) >> ((1 << (5)))) & (0xffffffffffffffff / ((1 << ((1 << (5)))) + 1))); return (unsigned) n; } unsigned tH_hamming(bitboard w) { bitboard res; assert(w); res = (w & 0x5555555555555555) + ((w >> 1) & 0x5555555555555555); res = (res & 0x3333333333333333) + ((res >> 2) & 0x3333333333333333); res = (res & 0x0F0F0F0F0F0F0F0F) + ((res >> 4) & 0x0F0F0F0F0F0F0F0F); res = (res & 0x00FF00FF00FF00FF) + ((res >> 8) & 0x00FF00FF00FF00FF); res = (res & 0x0000FFFF0000FFFF) + ((res >> 16) & 0x0000FFFF0000FFFF); return (unsigned) ((res & 0x00000000FFFFFFFF) + ((res >> 32) & 0x00000000FFFFFFFF)); } #ifdef UNIT_TEST #include <stdio.h> #include <stdlib.h> #include <string.h> int main(void) { unsigned int ut0_and_masks = 0; unsigned int ut1_and_masks = 0; unsigned int ut3_rmlsbsub = 0; unsigned int ut4_rmlsbmask = 0; unsigned int ut5_testlsb = 0; unsigned int ut6_testmsb = 0; unsigned int ut7_testsignandshift = 0; unsigned int ut8_testeachbit = 0; unsigned int ut9_testeachbit1shl = 0; unsigned int utA_tableshift = 0; unsigned int utB_tableuchar = 0; unsigned int utC_tableshiftcast = 0; unsigned int utD_itableshift = 0; unsigned int utE_itableuchar = 0; unsigned int utF_itableshiftcast = 0; unsigned int utG_parallel = 0; unsigned int utH_hamming = 0; bitboard b; int i; for (i = 0; i < 10000000; i++) { #ifdef DENSE b = rand(); b <<= 32; b |= rand(); #else b = (1 << (rand() % 64)); b <<= 32; b |= (1 << (rand() % 64)); #endif ut0_and_masks = t0_and_masks(b); ut1_and_masks = t1_and_masks(b); ut3_rmlsbsub = t3_rmlsbsub(b); ut4_rmlsbmask = t4_rmlsbmask(b); ut5_testlsb = t5_testlsb(b); ut6_testmsb = t6_testmsb(b); ut7_testsignandshift = t7_testsignandshift(b); ut8_testeachbit = t8_testeachbit(b); ut9_testeachbit1shl = t9_testeachbit1shl(b); utA_tableshift = tA_tableshift(b); utB_tableuchar = tB_tableuchar(b); utC_tableshiftcast = tC_tableshiftcast(b); utD_itableshift = tD_itableshift(b); utE_itableuchar = tE_itableuchar(b); utF_itableshiftcast = tF_itableshiftcast(b); utG_parallel = tG_parallel(b); utH_hamming = tH_hamming(b); if ((ut0_and_masks != ut1_and_masks) || (ut0_and_masks != ut3_rmlsbsub) || (ut0_and_masks != ut4_rmlsbmask) || (ut0_and_masks != ut5_testlsb) || (ut0_and_masks != ut6_testmsb) || (ut0_and_masks != ut7_testsignandshift) || (ut0_and_masks != ut9_testeachbit1shl) || (ut0_and_masks != utA_tableshift) || (ut0_and_masks != utB_tableuchar) || (ut0_and_masks != utC_tableshiftcast) || (ut0_and_masks != utD_itableshift) || (ut0_and_masks != utE_itableuchar) || (ut0_and_masks != utF_itableshiftcast) || (ut0_and_masks != utG_parallel) || (ut0_and_masks != utH_hamming) ) { printf("0: %u 1: %u\n", ut0_and_masks, ut1_and_masks); printf("0: %u 3: %u\n", ut0_and_masks, ut3_rmlsbsub); printf("0: %u 4: %u\n", ut0_and_masks, ut4_rmlsbmask); printf("0: %u 5: %u\n", ut0_and_masks, ut5_testlsb); printf("0: %u 6: %u\n", ut0_and_masks, ut6_testmsb); printf("0: %u 7: %u\n", ut0_and_masks, ut7_testsignandshift); printf("0: %u 9: %u\n", ut0_and_masks, ut9_testeachbit1shl); printf("0: %u A: %u\n", ut0_and_masks, utA_tableshift); printf("0: %u B: %u\n", ut0_and_masks, utB_tableuchar); printf("0: %u C: %u\n", ut0_and_masks, utC_tableshiftcast); printf("0: %u D: %u\n", ut0_and_masks, utD_itableshift); printf("0: %u E: %u\n", ut0_and_masks, utE_itableuchar); printf("0: %u F: %u\n", ut0_and_masks, utF_itableshiftcast); printf("0: %u G: %u\n", ut0_and_masks, utG_parallel); printf("0: %u H: %u\n", ut0_and_masks, utH_hamming); } b = 0xffffffffffffffff; ut0_and_masks = t0_and_masks(b); ut1_and_masks = t1_and_masks(b); ut3_rmlsbsub = t3_rmlsbsub(b); ut4_rmlsbmask = t4_rmlsbmask(b); ut5_testlsb = t5_testlsb(b); ut6_testmsb = t6_testmsb(b); ut7_testsignandshift = t7_testsignandshift(b); ut8_testeachbit = t8_testeachbit(b); ut9_testeachbit1shl = t9_testeachbit1shl(b); utA_tableshift = tA_tableshift(b); utB_tableuchar = tB_tableuchar(b); utC_tableshiftcast = tC_tableshiftcast(b); utD_itableshift = tD_itableshift(b); utE_itableuchar = tE_itableuchar(b); utF_itableshiftcast = tF_itableshiftcast(b); utG_parallel = tG_parallel(b); utH_hamming = tH_hamming(b); if ((ut0_and_masks != ut1_and_masks) || (ut0_and_masks != ut3_rmlsbsub) || (ut0_and_masks != ut4_rmlsbmask) || (ut0_and_masks != ut5_testlsb) || (ut0_and_masks != ut6_testmsb) || (ut0_and_masks != ut7_testsignandshift) || (ut0_and_masks != ut9_testeachbit1shl) || (ut0_and_masks != utA_tableshift) || (ut0_and_masks != utB_tableuchar) || (ut0_and_masks != utC_tableshiftcast) || (ut0_and_masks != utD_itableshift) || (ut0_and_masks != utE_itableuchar) || (ut0_and_masks != utF_itableshiftcast) || (ut0_and_masks != utG_parallel) || (ut0_and_masks != utH_hamming) ) { printf("0: %u 1: %u\n", ut0_and_masks, ut1_and_masks); printf("0: %u 3: %u\n", ut0_and_masks, ut3_rmlsbsub); printf("0: %u 4: %u\n", ut0_and_masks, ut4_rmlsbmask); printf("0: %u 5: %u\n", ut0_and_masks, ut5_testlsb); printf("0: %u 6: %u\n", ut0_and_masks, ut6_testmsb); printf("0: %u 7: %u\n", ut0_and_masks, ut7_testsignandshift); printf("0: %u 9: %u\n", ut0_and_masks, ut9_testeachbit1shl); printf("0: %u A: %u\n", ut0_and_masks, utA_tableshift); printf("0: %u B: %u\n", ut0_and_masks, utB_tableuchar); printf("0: %u C: %u\n", ut0_and_masks, utC_tableshiftcast); printf("0: %u D: %u\n", ut0_and_masks, utD_itableshift); printf("0: %u E: %u\n", ut0_and_masks, utE_itableuchar); printf("0: %u F: %u\n", ut0_and_masks, utF_itableshiftcast); printf("0: %u G: %u\n", ut0_and_masks, utG_parallel); printf("0: %u H: %u\n", ut0_and_masks, utH_hamming); } b = 1; ut0_and_masks = t0_and_masks(b); ut1_and_masks = t1_and_masks(b); ut3_rmlsbsub = t3_rmlsbsub(b); ut4_rmlsbmask = t4_rmlsbmask(b); ut5_testlsb = t5_testlsb(b); ut6_testmsb = t6_testmsb(b); ut7_testsignandshift = t7_testsignandshift(b); ut8_testeachbit = t8_testeachbit(b); ut9_testeachbit1shl = t9_testeachbit1shl(b); utA_tableshift = tA_tableshift(b); utB_tableuchar = tB_tableuchar(b); utC_tableshiftcast = tC_tableshiftcast(b); utD_itableshift = tD_itableshift(b); utE_itableuchar = tE_itableuchar(b); utF_itableshiftcast = tF_itableshiftcast(b); utG_parallel = tG_parallel(b); utH_hamming = tH_hamming(b); if ((ut0_and_masks != ut1_and_masks) || (ut0_and_masks != ut3_rmlsbsub) || (ut0_and_masks != ut4_rmlsbmask) || (ut0_and_masks != ut5_testlsb) || (ut0_and_masks != ut6_testmsb) || (ut0_and_masks != ut7_testsignandshift) || (ut0_and_masks != ut9_testeachbit1shl) || (ut0_and_masks != utA_tableshift) || (ut0_and_masks != utB_tableuchar) || (ut0_and_masks != utC_tableshiftcast) || (ut0_and_masks != utD_itableshift) || (ut0_and_masks != utE_itableuchar) || (ut0_and_masks != utF_itableshiftcast) || (ut0_and_masks != utG_parallel) || (ut0_and_masks != utH_hamming) ) { printf("0: %u 1: %u\n", ut0_and_masks, ut1_and_masks); printf("0: %u 3: %u\n", ut0_and_masks, ut3_rmlsbsub); printf("0: %u 4: %u\n", ut0_and_masks, ut4_rmlsbmask); printf("0: %u 5: %u\n", ut0_and_masks, ut5_testlsb); printf("0: %u 6: %u\n", ut0_and_masks, ut6_testmsb); printf("0: %u 7: %u\n", ut0_and_masks, ut7_testsignandshift); printf("0: %u 9: %u\n", ut0_and_masks, ut9_testeachbit1shl); printf("0: %u A: %u\n", ut0_and_masks, utA_tableshift); printf("0: %u B: %u\n", ut0_and_masks, utB_tableuchar); printf("0: %u C: %u\n", ut0_and_masks, utC_tableshiftcast); printf("0: %u D: %u\n", ut0_and_masks, utD_itableshift); printf("0: %u E: %u\n", ut0_and_masks, utE_itableuchar); printf("0: %u F: %u\n", ut0_and_masks, utF_itableshiftcast); printf("0: %u G: %u\n", ut0_and_masks, utG_parallel); printf("0: %u H: %u\n", ut0_and_masks, utH_hamming); } } return 0; } #endif

by **Pradu** » 19 Jul 2005, 22:39

Reinhard Scharnagl wrote:Hi Pradu,

the popCount() function is suboptimal - am I rigth ?

Reinhard.

I'm not sure Reinhard, but I do know it was good enough to be used in DarkThought.

Quoting Ernst A. Heinz:
"It performs better than intuitive methods with lookup tables because the tables get either too large or need too many lookups."

I can't find any explaination for the non-iterative popCount() function anywhere and am just going by Heinz's word right now, unless someone or some paper I come across can prove to me otherwise.

by **Reinhard Scharnagl** » 19 Jul 2005, 23:13

Hi Pradu,

How about:

Code: Select all: /* Trial (R. Scharnagl, first idea, not optimized) */ /* (P.S.: now unsigned used at several places) */ #define msk1 0xEEEEEEEE #define msk2 0xCCCCCCCC #define msk3 0x88888888 #define msk4 0x0F0F0F0F int popCount(const unsigned long long b) { unsigned buf; int acc; buf = (unsigned)b; acc = buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); buf = ((unsigned *)&b)[1]; /* Intel format */ acc += buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); acc = (acc & msk4) + ((acc >> 4) & msk4); acc = (acc & 0xFFFF) + (acc >> 16); acc = (acc & 0xFF) + (acc >> 8); return acc; }

Reinhard.
P.S.:

Code: Select all: /* Trial (R. Scharnagl, first idea, not optimized) */ /* (might be now an endian independent form) */ #define msk1 0xEEEEEEEE #define msk2 0xCCCCCCCC #define msk3 0x88888888 #define msk4 0x0F0F0F0F int popCount(const unsigned long long b) { unsigned buf; int acc; buf = ((unsigned *)&b)[0]; acc = buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); buf = ((unsigned *)&b)[1]; acc += buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); acc = (acc & msk4) + ((acc >> 4) & msk4); acc = (acc & 0xFFFF) + (acc >> 16); acc = (acc & 0xFF) + (acc >> 8); return acc; }

by **Pradu** » 19 Jul 2005, 23:20

All right let me test it and post some results in a little while.

by **Pradu** » 19 Jul 2005, 23:57

It isn't as good.

I hope my testing procedure is good enough

Test Program:

Code: Select all: #include <stdio.h> #include <stdlib.h> #include <math.h> #include <sys/timeb.h> typedef unsigned long long U64; #define rand64() (((U64)rand())^(((U64)rand())<<15)^(((U64)rand())<<30)^(((U64)rand())<<45)^(((U64)rand())<<60)) #define TABLE_SIZE 10000000 U64 randtable[TABLE_SIZE]; int getms() { struct timeb timebuffer; ftime(&timebuffer); return (timebuffer.time * 1000) + timebuffer.millitm; } #define m1 0x5555555555555555ULL #define m2 0x3333333333333333ULL int popCount(const U64 b) { int n; const U64 a = b - ((b >> 1) & m1); const U64 c = (a & m2) + ((a >> 2) & m2); n = ((int) c) + ((int) (c >> 32)); n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); n = (n & 0xFFFF) + (n >> 16); n = (n & 0xFF) + (n >> 8); return n; } #define msk1 0xEEEEEEEE #define msk2 0xCCCCCCCC #define msk3 0x88888888 #define msk4 0x0F0F0F0F int popCount2(const U64 b) /* P.S.: might be unsigned */ { int buf; int acc; buf = (int)b; acc = buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); buf = ((int *)&b)[1]; /* Intel format */ acc += buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); acc = (acc & msk4) + ((acc >> 4) & msk4); acc = (acc & 0xFFFF) + (acc >> 16); acc = (acc & 0xFF) + (acc >> 8); return acc; } int main(int argc, char *argv[]) { int i, time; time=getms(); for(i=0;i<TABLE_SIZE;i++) randtable[i]=rand64(); printf("Initialized rand Table: %d ms\n",getms()-time); time=getms(); for(i=0;i<TABLE_SIZE;i++) popCount2(randtable[i]); printf("popCount2: %d ms\n",getms()-time); time=getms(); for(i=0;i<TABLE_SIZE;i++) popCount(randtable[i]); printf("popCount: %d ms\n",getms()-time); system("PAUSE"); return 0; }

Results on Pentium II 448 Mhz:

Code: Select all: Initialized rand Table: 7150 ms popCount2: 370 ms popCount: 51 ms

with popCount2 as:

Code: Select all: int popCount2(const U64 b) /* P.S.: might be unsigned */ { unsigned buf; int acc; buf = (int)b; acc = buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); buf = ((int *)&b)[1]; /* Intel format */ acc += buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); acc = (acc & msk4) + ((acc >> 4) & msk4); acc = (acc & 0xFFFF) + (acc >> 16); acc = (acc & 0xFF) + (acc >> 8); return acc; }

Results are:

Code: Select all: Initialized rand Table: 6709 ms popCount2: 321 ms popCount: 60 ms

by **Reinhard Scharnagl** » 20 Jul 2005, 00:02

Pradu,

thank you for testing,
could be wrong defined constants, try:

Code: Select all: #define msk1 0xEEEEEEEEUL #define msk2 0xCCCCCCCCUL #define msk3 0x88888888UL #define msk4 0x0F0F0F0FUL

Reinhard.

P.S. in debug mode I have got

Initialized rand Table: 2594 ms
popCount2: 1922 ms
popCount: 2609 ms

by **Pradu** » 20 Jul 2005, 00:16

With the unsigned constants the results havnen't changed much:

Code: Select all: Initialized rand Table: 6660 ms popCount2: 330 ms popCount: 50 ms Initialized rand Table: 6600 ms popCount2: 320 ms popCount: 60 ms Initialized rand Table: 7230 ms popCount2: 321 ms popCount: 50 ms

P.S. in debug mode I have got

Initialized rand Table: 2594 ms
popCount2: 1922 ms
popCount: 2609 ms

P.S. in debug mode I have got

Initialized rand Table: 2594 ms
popCount2: 1922 ms
popCount: 2609 ms

Quite odd -- I'm using GCC compiler with optimizations. I wonder why the popCount takes much longer in your compile. Could you post your source and exe on a server or email it to me.

by **Reinhard Scharnagl** » 20 Jul 2005, 00:19

Pradu,

in release mode I get only zero times ... ?

Reinhard.

by **Pradu** » 20 Jul 2005, 00:27

Reinhard Scharnagl wrote:Pradu,

in release mode I get only zero times ... ?

Reinhard.

I'm not sure how that happened. What compiler do you use and what processor?

This is what I used to compile:

Code: Select all: gcc.exe main.o -o "PopCountTest.exe" -L"F:/Dev-Cpp/lib" -lobjc

by **Dann Corbit** » 20 Jul 2005, 00:29

An important thing to think about is that almost all the time your bitboards will be very sparse. Even a full board with no chessmen removed is only half full of bits.

Your random function creates dense bitboards.

It would be good to experiment with both kinds.

by **Reinhard Scharnagl** » 20 Jul 2005, 00:33

Pradu,

it has been an effect of the optimizer in VStudio, thus I changed the test somehow:

Code: Select all: #include <stdio.h> #include <stdlib.h> #include <math.h> #include <sys/timeb.h> typedef unsigned long long U64; #define rand64() (((U64)rand())^(((U64)rand())<<15)^(((U64)rand())<<30)^(((U64)rand())<<45)^(((U64)rand())<<60)) #define TABLE_SIZE 10000000 U64 randtable[TABLE_SIZE]; int getms() { struct timeb timebuffer; ftime(&timebuffer); return (timebuffer.time * 1000) + timebuffer.millitm; } #define m1 0x5555555555555555ULL #define m2 0x3333333333333333ULL int popCount(const U64 b) { int n; const U64 a = b - ((b >> 1) & m1); const U64 c = (a & m2) + ((a >> 2) & m2); n = ((int) c) + ((int) (c >> 32)); n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); n = (n & 0xFFFF) + (n >> 16); n = (n & 0xFF) + (n >> 8); return n; } /* Trial (R. Scharnagl, first idea, not optimized) */ /* (might be now an endian independent form) */ #define msk1 0xEEEEEEEEUL #define msk2 0xCCCCCCCCUL #define msk3 0x88888888UL #define msk4 0x0F0F0F0FUL int popCount2(const U64 b) { unsigned buf; int acc; buf = (unsigned)b; acc = buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); buf = (unsigned)(b>>32); acc += buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); acc = (acc & msk4) + ((acc >> 4) & msk4); acc = (acc & 0xFFFF) + (acc >> 16); acc = (acc & 0xFF) + (acc >> 8); return acc; } int main(int argc, char *argv[]) { int i, loop, sum; volatile int time=getms(); for(i=0;i<TABLE_SIZE;i++) randtable[i]=rand64(); printf("Initialized rand Table: %d ms\n",getms()-time); time=getms(); sum= 0; for(loop = 25; --loop >=0; ) for(i=0;i<TABLE_SIZE;i++) sum += popCount2(randtable[i]); printf("popCount2: %d ms (%d)\n",getms()-time, sum); time=getms(); sum= 0; for(loop = 25; --loop >=0; ) for(i=0;i<TABLE_SIZE;i++) sum += popCount(randtable[i]); printf("popCount: %d ms (%d)\n",getms()-time, sum); system("PAUSE"); return 0; }

Code: Select all: Initialized rand Table: 1156 ms popCount2: 2797 ms (-589497692) popCount: 4312 ms (-589497692)

Reinhard

by **Pradu** » 20 Jul 2005, 00:44

popCount2 seems to be faster now:

Code: Select all: Initialized rand Table: 6539 ms popCount2: 25967 ms (-589506992) popCount: 26619 ms (-589506992)

The addition operation seems to take up a lot of processor time:

Code: Select all: #include <stdio.h> #include <stdlib.h> #include <math.h> #include <sys/timeb.h> typedef unsigned long long U64; #define rand64() (((U64)rand())^(((U64)rand())<<15)^(((U64)rand())<<30)^(((U64)rand())<<45)^(((U64)rand())<<60)) #define TABLE_SIZE 10000000 U64 randtable[TABLE_SIZE]; int getms() { struct timeb timebuffer; ftime(&timebuffer); return (timebuffer.time * 1000) + timebuffer.millitm; } #define m1 0x5555555555555555ULL #define m2 0x3333333333333333ULL int popCount(const U64 b) { int n; const U64 a = b - ((b >> 1) & m1); const U64 c = (a & m2) + ((a >> 2) & m2); n = ((int) c) + ((int) (c >> 32)); n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); n = (n & 0xFFFF) + (n >> 16); n = (n & 0xFF) + (n >> 8); return n; } /* Trial (R. Scharnagl, first idea, not optimized) */ /* (might be now an endian independent form) */ #define msk1 0xEEEEEEEEUL #define msk2 0xCCCCCCCCUL #define msk3 0x88888888UL #define msk4 0x0F0F0F0FUL int popCount2(const U64 b) { unsigned buf; int acc; buf = (unsigned)b; acc = buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); buf = (unsigned)(b>>32); acc += buf - ((buf & msk1)>>1) - ((buf & msk2)>>2) - ((buf & msk3)>>3); acc = (acc & msk4) + ((acc >> 4) & msk4); acc = (acc & 0xFFFF) + (acc >> 16); acc = (acc & 0xFF) + (acc >> 8); return acc; } int main(int argc, char *argv[]) { int i, loop, sum; volatile int time=getms(); for(i=0;i<TABLE_SIZE;i++) randtable[i]=rand64(); printf("Initialized rand Table: %d ms\n",getms()-time); time=getms(); sum= 0; for(loop = 25; --loop >=0; ) for(i=0;i<TABLE_SIZE;i++) /*sum += */popCount2(randtable[i]); printf("popCount2: %d ms\n",getms()-time/*, sum*/); time=getms(); sum= 0; for(loop = 25; --loop >=0; ) for(i=0;i<TABLE_SIZE;i++) /*sum += */popCount(randtable[i]); printf("popCount: %d ms\n",getms()-time/*, sum*/); system("PAUSE"); return 0; }

Code: Select all: Initialized rand Table: 6690 ms popCount2: 1322 ms popCount: 1302 ms

The popCount2 seems to be as good as popCount with GCC, I'll have to try to compile it with Visual Studio 2003 C compiler when I get back to my dorm room with my much faster P4 2.8 Ghz computer :-)

.

Thanks Reinhard.

Winboard Forum

GCC Inline asm problem

GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

not ASM popCnt

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Re: GCC Inline asm problem

Who is online