Winboard Forum

by **Gerd Isenberg** » 01 Aug 2005, 21:29

Dieter B?r?ner wrote:
Gerd Isenberg wrote:Since leading zero count is also necessary to convert ints to floats/double there are also some more or less portable tricks to interprete the binary representation of a double, the base two exponent of a normalized mantissa.

Interesting thought. Especially in C99, it should be very portable for 32-bit integers. C99 has the ilogb() function (it was available since long on many systems). ilogb() should even get inlined by a good compiler. I think, it will not work on 64 bit integers, however. We can pretty much assume IEEE 754 floating point representation (and it can even be checked), but we cannot assume, that we have a floating point type, that has enough accuracy for 64 bit numbers.

Cheers,
Dieter

Hi Dieter,

didn't you once post some c-code in CCC where 64-bit x &-x isolated LSB was converted to a double, while an unsigned char pointer was alialising that double, interpreting the exponent (and sign bit)?

Since we are only interested in MSB some rounding toward zero or down

AMD64 Architecture Programmer?s Manual
Volume 1: Application Programming
Page 158 4.4.9 Floating-Point Rounding

affects only the lowest bit of the normalzed mantissa without any overflow. I guess even a float convert is exact enough to determine the MSB by the exponent that way.

But if i look for appropriate sse or 3dnow instructions, converting 64-bit int, there seems only CVTSI2SD - with 11 cycles double dispatch not particular fast, and - converting to scalar 4 byte float with CVTSI2SS is even 14 cycles vector path.

There are faster SIMD instructions, like CVTDQ2PS (5 cycles double dispatch), converting vectors of four 32-bit ints to four 32-bit floats - but i fear it is not worth, also due to the hussle with none default rounding control Bits 14?13 of the MXCSR control and status register.

I once tried LSB by 3DNow/mmx-version via PI2FD which is only 4 cycles direct path - not that bad if you like to bitscan in mmx ;-)

Gerd

by **Anonymous** » 02 Aug 2005, 19:36

Hi Gerd,

> didn't you once post some c-code in CCC where 64-bit x &-x isolated
> LSB was converted to a double, while an unsigned char pointer was
> alialising that double, interpreting the exponent (and sign bit)?

Sounds rather possible. But this case is much easier. And it is even almost portable (when we assume IEEE fp, which is the case for any modern system I know, and when we take care of the endianess). It can be coded in pure C.

> Since we are only interested in MSB some rounding toward zero or down

Correct. Indeed with rounding mode set to round down, it will work even in float precision. But there is no way, to set the rounding mode in a "more or less portable [and efficient] way". (with C99, it is possible in principle). It is also clear, that it would be very inefficient to set and reset the rounding mode before and after every call to the find_msb() function. So you would need to set the rounding mode once, and not reset it afterwards. But could you be sure, that the compiler will leave it that way? For example, a compiler might change the rounding mode when casting an integer to a float. Then it might reset the rounding mode again to the default rounding mode (not to the rounding mode, you had set). It might even be possible, that some library function (or some compiler generated code in general) depends on the rounding mode to be set to default, and might bug, when it is not. Even when tests would show, that it works, you couldn't really be sure, that it will still work after some totally unrelated code changes.

With x86, you also need a branch, to load an unsigned 64-bit integer to a floating point register (there is only an instruction to load a signed 64-bit integer). So, all in all, does not sound too attractive to me.

I cannot comment on the SIMD, 3dnow, ... tricks.

Cheers,
Dieter

by **Anonymous** » 02 Aug 2005, 19:45

Dann Corbit wrote:3 EXAMPLE An example of undefined behavior is the behavior on integer overflow."

But it is just an example.

From section "J.2 Undefined behavior" we do have this:
"? The value of the result of an integer arithmetic or conversion function cannot be represented (7.8.2.1, 7.8.2.2, 7.8.2.3, 7.8.2.4, 7.20.6.1, 7.20.6.2, 7.20.1)."

Thanks, Dann! This was exactly, what I was looking for (and what I expected). It means that even

Code: Select all: long nodes; int absearch() { nodes++; [...] }

can yield in undefined behaviour (when calculating enough nodes).

Regards,
Dieter

by **Dann Corbit** » 02 Aug 2005, 19:50

Dieter B?r?ner wrote:
Dann Corbit wrote:3 EXAMPLE An example of undefined behavior is the behavior on integer overflow."

But it is just an example.

From section "J.2 Undefined behavior" we do have this:
"? The value of the result of an integer arithmetic or conversion function cannot be represented (7.8.2.1, 7.8.2.2, 7.8.2.3, 7.8.2.4, 7.20.6.1, 7.20.6.2, 7.20.1)."

Thanks, Dann! This was exactly, what I was looking for (and what I expected). It means that even

Code: Select all
long nodes; int absearch() { nodes++; [...] }

can yield in undefined behaviour (when calculating enough nodes).

Regards,
Dieter

Yes. and conversely:

Code: Select all: unsigned long nodes; int absearch() { nodes++; /* This cannot cause undefined behavior */ /*... */ }

by **Gerd Isenberg** » 02 Aug 2005, 20:49

Dieter B?r?ner wrote:Hi Gerd,

> didn't you once post some c-code in CCC where 64-bit x &-x isolated
> LSB was converted to a double, while an unsigned char pointer was
> alialising that double, interpreting the exponent (and sign bit)?

Sounds rather possible. But this case is much easier. And it is even almost portable (when we assume IEEE fp, which is the case for any modern system I know, and when we take care of the endianess). It can be coded in pure C.

> Since we are only interested in MSB some rounding toward zero or down

Correct. Indeed with rounding mode set to round down, it will work even in float precision. But there is no way, to set the rounding mode in a "more or less portable [and efficient] way". (with C99, it is possible in principle). It is also clear, that it would be very inefficient to set and reset the rounding mode before and after every call to the find_msb() function. So you would need to set the rounding mode once, and not reset it afterwards. But could you be sure, that the compiler will leave it that way? For example, a compiler might change the rounding mode when casting an integer to a float. Then it might reset the rounding mode again to the default rounding mode (not to the rounding mode, you had set). It might even be possible, that some library function (or some compiler generated code in general) depends on the rounding mode to be set to default, and might bug, when it is not. Even when tests would show, that it works, you couldn't really be sure, that it will still work after some totally unrelated code changes.

With x86, you also need a branch, to load an unsigned 64-bit integer to a floating point register (there is only an instruction to load a signed 64-bit integer). So, all in all, does not sound too attractive to me.

I cannot comment on the SIMD, 3dnow, ... tricks.

Cheers,
Dieter

Hi Dieter,

yes, to avoid further rounding issues one may reset MSB>>32.
This one works with msc for my box, but i fear on x87-platforms it is much too slow.

Gerd

Code: Select all: typedef unsigned __int64 BitBoard; // return index 0..63 of MSB // -1023 if passing zero unsigned int bitScanReverse(BitBoard bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & ~(bb >> 32)); return ud.exponent - 1023; }

a bit faster looks a signed conversion - if one inspects the assembly.

Code: Select all: unsigned int bitScanReverse(BitBoard bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(__int64)(bb & ~(bb >> 32)); unsigned int idx = (ud.exponent - 1023) | (63*ud.sign); // printf ("0x%08x%08x %5d %25.4f\n", (unsigned int)(bb>>32), (unsigned int)bb, idx, ud.d ); return idx; }

by **Gerd Isenberg** » 02 Aug 2005, 23:03

This looks already competitive:

Code: Select all: typedef unsigned __int64 BitBoard; union BB { BB(BitBoard b) {bb=b;} double getSignedDouble() {return (double)(__int64)bb;} BitBoard bb; struct { unsigned int lo; unsigned int hi; }; }; unsigned int bitScanReverse(BB bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; bb.lo &= ~bb.hi; ud.d = bb.getSignedDouble(); unsigned int idx = (ud.exponent - 1023) | (63*ud.sign); // printf ("0x%08x%08x %5d %25.4f\n", bb.hi, bb.lo, idx, ud.d ); return idx; }

the assembly with 6,2 cycle fild and fstp instructions looks not that bad:

Code: Select all: ?bitScanReverse@@YAITBB@@@Z PROC NEAR ; bitScanReverse ; File C:\Source\bitScan\bitScan.cpp ; Line 34 00000 8b 44 24 08 mov eax, DWORD PTR _bb$[esp] 00004 8b 4c 24 04 mov ecx, DWORD PTR _bb$[esp-4] 00008 f7 d0 not eax 0000a 23 c8 and ecx, eax 0000c 89 4c 24 04 mov DWORD PTR _bb$[esp-4], ecx ; Line 35 00010 df 6c 24 04 fild QWORD PTR _bb$[esp-4] 00014 dd 5c 24 04 fstp QWORD PTR _ud$[esp-4] ; Line 38 00018 8b 4c 24 08 mov ecx, DWORD PTR _ud$[esp] 0001c 8b c1 mov eax, ecx 0001e c1 e9 1f shr ecx, 31 ; 0000001fH 00021 c1 e8 14 shr eax, 20 ; 00000014H 00024 8b d1 mov edx, ecx 00026 25 ff 07 00 00 and eax, 2047 ; 000007ffH 0002b c1 e2 06 shl edx, 6 0002e 2d ff 03 00 00 sub eax, 1023 ; 000003ffH 00033 2b d1 sub edx, ecx 00035 0b c2 or eax, edx ; Line 39 00037 c3 ret 0 ?bitScanReverse@@YAITBB@@@Z ENDP ; bitScanReverse

by **Anonymous** » 02 Aug 2005, 23:14

Gerd Isenberg wrote:
Code: Select all
typedef unsigned __int64 BitBoard; // return index 0..63 of MSB // -1023 if passing zero unsigned int bitScanReverse(BitBoard bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & ~(bb >> 32)); return ud.exponent - 1023; }

Wow, very smart, Gerd! I took me a while, to figure out, how this works. I wonder if any other readers of this read (there will be few readers that follow this thread that deep, I guess) did figure out the bb & ~(bb >> 32) fast.

I did not try it, but it really looks, as if it gets rid of the rounding/accuracy issue.

Cheers,
Dieter

PS. Personally, I'd prefer using the same method, while aliasing unsigned char to the double, instead of the union with the bitfields. Practically, it probably will not make a huge difference. Aliasing the unsigned char should be more portable. As you probably know, in C nothing is guaranteed, when you put in one field into a union, and take out another field. Neither is it well defined, how bitfiels are layed out. Aliasing to unsigned char is better defined: for example, it is guaranteed, that unsigned char has no padding bits and a pure binary representation. So, the only portability issue would be, to assume IEEE fp representation, assume CHAR_BITS == 8, and take care of the endianess (and possibly the alignement of the double). When using the bitfields inside the union, you have to assume more. Certainly, using bitfields as you did above, will look clearer.

PS2: Many C environments (not only on x86) have a long double type, that has enough accuracy, to just cast the 64 bit integer to long double and inspect the exponent. Older versions of MSVC did have such a type (for newer versions double and long double are the same), also Gcc environments I know for x86 do support such a type. On x86 in this context, there is basically no penalty, to use 80 bit floating point type.

by **Anonymous** » 02 Aug 2005, 23:39

Gerd Isenberg wrote:This looks already competitive:

[code]
?bitScanReverse@@YAITBB@@@Z PROC NEAR ; bitScanReverse
; File C:\Source\bitScan\bitScan.cpp
; Line 34
00000 8b 44 24 08 mov eax, DWORD PTR _bb$[esp]
00004 8b 4c 24 04 mov ecx, DWORD PTR _bb$[esp-4]
00008 f7 d0 not eax
0000a 23 c8 and ecx, eax
0000c 89 4c 24 04 mov DWORD PTR _bb$[esp-4], ecx
; Line 35
00010 df 6c 24 04 fild QWORD PTR _bb$[esp-4]
00014 dd 5c 24 04 fstp QWORD PTR _ud$[esp-4]
; Line 38
00018 8b 4c 24 08 mov ecx, DWORD PTR _ud$[esp]
0001c 8b c1 mov eax, ecx
0001e c1 e9 1f shr ecx, 31 ; 0000001fH
00021 c1 e8 14 shr eax, 20 ; 00000014H
00024 8b d1 mov edx, ecx
00026 25 ff 07 00 00 and eax, 2047 ; 000007ffH
0002b c1 e2 06 shl edx, 6
0002e 2d ff 03 00 00 sub eax, 1023 ; 000003ffH
00033 2b d1 sub edx, ecx
00035 0b c2 or eax, edx
; Line 39
00037 c3 ret 0
?bitScanReverse@@YAITBB@@@Z ENDP ; bitScanReverse

Agreed, looks very competitive. One can even hope, that (in the future) the floating point instructions can be calculated in parallel to some integer instructions (with the help of some smart compiler). On x86, there will probably be too few registers available, however.

I should mention, that your "*63" trick again is very smart!

Regards,
Dieter

by **Anonymous** » 04 Aug 2005, 07:28

Here are some test results I conducted on an athlon xp 2500+ barton @ 1.9G:

Code: Select all: debruijn32 method clock cycles: 14.2 result: 32505856 bsf32 method clock cycles: 11.2 result: 32505856 debruijn64 method clock cycles: 24.8 result: 33030144 bsf64 method 1 clock cycles: 22.2 result: 33030144 bsf64 method 2 clock cycles: 22.7 result: 33030144 double_conversion_msb64 method clock cycles: 41.0 result: 33030144

with this code:

Code: Select all: #include <iostream> using namespace std; #include <windows.h> typedef unsigned int uint; typedef char int8; typedef short int16; typedef int int32; typedef long long int64; typedef unsigned char uint8; typedef unsigned short uint16; typedef unsigned int uint32; typedef unsigned long long uint64; class win_timer { public: win_timer(); void start(); void stop(); void reset(); double reading(); bool running(); private: LARGE_INTEGER time; double frequency; bool rn; }; win_timer::win_timer() { LARGE_INTEGER f; QueryPerformanceFrequency(&f); frequency = f.QuadPart; stop(); reset(); } void win_timer::start() { if(!rn) { rn = true; LARGE_INTEGER f; QueryPerformanceCounter(&f); time.QuadPart = f.QuadPart - time.QuadPart; } } void win_timer::stop() { if(rn) { LARGE_INTEGER f; QueryPerformanceCounter(&f); time.QuadPart = f.QuadPart - time.QuadPart; rn = false; } } void win_timer::reset() { if(rn) { LARGE_INTEGER f; QueryPerformanceCounter(&f); time = f; } else { time.QuadPart = 0; } } double win_timer::reading() { if(rn) { LARGE_INTEGER f; QueryPerformanceCounter(&f); return (f.QuadPart - time.QuadPart)/frequency; } else { return time.QuadPart/frequency; } } bool win_timer::running() { return rn; } #define debruijn32 0x077cb531UL int debruijn32_index[32]; void debruijn32_init() { for(int i=0; i<32; i++) debruijn32_index[(debruijn32 << i) >> 27] = i; } int debruijn32_lsb(uint32 x) { return debruijn32_index[((x & -x) * debruijn32) >> 27]; } /* de Bruijn 64*/ const int lsz64_tbl[64] = { 0, 31, 4, 33, 60, 15, 12, 34, 61, 25, 51, 10, 56, 20, 22, 35, 62, 30, 3, 54, 52, 24, 42, 19, 57, 29, 2, 44, 47, 28, 1, 36, 63, 32, 59, 5, 6, 50, 55, 7, 16, 53, 13, 41, 8, 43, 46, 17, 26, 58, 49, 14, 11, 40, 9, 45, 21, 48, 39, 23, 18, 38, 37, 27, }; int debruijn64_lsb(uint64 bb) { const uint64 lsb = (bb & -int64(bb)) - 1; const uint32 foldedLSB = int32(lsb) ^ int32(lsb >> 32); return lsz64_tbl[(foldedLSB * 0x78291ACF) >> 26]; } typedef uint64 BitBoard; // return index 0..63 of MSB // -1023 if passing zero unsigned int bitScanReverse(BitBoard bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & ~(bb >> 32)); return ud.exponent - 1023; } #define CLOCKSPEED 1900000000 //my processor runs at this many hertz #define TEST_DATA_SIZE (8*1024*1024/4) //makes an 8 meg table of dwords #define TEST_DATA_64_SIZE (TEST_DATA_SIZE/2)//makes an 8 meg table of qwords uint32 test_data[TEST_DATA_SIZE]; uint64 test_data_q[TEST_DATA_64_SIZE]; int main(int argc, char *argv[]) { debruijn32_init(); for(int a=0; a<TEST_DATA_SIZE;) for(int b=0 ;a<TEST_DATA_SIZE,b<32; a++,b++) test_data[a] = uint32(1)<<b; uint acc = 0;//accumulator win_timer timer; timer.start(); for(int x=0; x<TEST_DATA_SIZE; x++) { acc += debruijn32_lsb(test_data[x]); } timer.stop(); cout<<"debruijn32 method"<<endl <<"clock cycles: "<<(timer.reading() / TEST_DATA_SIZE * CLOCKSPEED)<<endl <<"result: "<<acc<<endl<<endl; acc = 0; timer.reset(); timer.start(); for(int x=0; x<TEST_DATA_SIZE; x++) { asm( "movl _test_data(,%0,4),%%edx \n\t" //load test_data[x] into edx "bsfl %%edx,%%eax \n\t" //perform bsf (bit scan forward) "addl %%eax,%1 \n\t" //add result to accumulator : :"r"(x),"m"(acc) :"%eax","%edx" ); } timer.stop(); cout<<"bsf32 method"<<endl <<"clock cycles: "<<(timer.reading() / TEST_DATA_SIZE * CLOCKSPEED)<<endl <<"result: "<<acc<<endl<<endl; for(int a=0; a<TEST_DATA_64_SIZE;) for(int b=0 ;a<TEST_DATA_64_SIZE,b<64; a++,b++) test_data_q[a] = uint64(1)<<b; acc = 0; timer.reset(); timer.start(); for(int x=0; x<TEST_DATA_64_SIZE; x++) { acc += debruijn64_lsb(test_data_q[x]); } timer.stop(); cout<<"debruijn64 method"<<endl <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl <<"result: "<<acc<<endl<<endl; acc = 0; timer.reset(); timer.start(); for(int x=0; x<TEST_DATA_64_SIZE; x++) { asm("xorl %%edx,%%edx \n\t" //zero edx "xorl %%ecx,%%ecx \n\t" //zero ecx "orl _test_data_q(,%0,8),%%edx \n\t" //copy lower dword of test_data_q[x] to edx and set flags "jnz skippy \n\t" //if(edx != 0) go to skippy "movl _test_data_q+4(,%0,8),%%edx \n\t" //else{ copy upper dword of test_data_q[x] to edx "movl $32,%%ecx \n\t" //move 32 into ecx, will add this into result later } "skippy: \n\t" "bsfl %%edx,%%eax \n\t" //perform the bsf "addl %%ecx,%%eax \n\t" //add in ecx "addl %%eax,%1 \n\t" //add result to accumulator : :"r"(x),"m"(acc) :"%eax","%ecx","%edx" ); } timer.stop(); cout<<"bsf64 method 1"<<endl <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl <<"result: "<<acc<<endl<<endl; acc = 0; timer.reset(); timer.start(); for(int x=0; x<TEST_DATA_64_SIZE; x++) { asm("bsfl _test_data_q+4(,%0,8),%%eax \n\t" //bsf upper dword of test_data_q[x] into eax "addl $32,%%eax \n\t" //add 32 to eax "bsfl _test_data_q(,%0,8),%%eax \n\t" //bsf lower dword of test_data_q[x] into eax. If(lower dword of test_data_q[x] == 0) eax is not modified. "addl %%eax,%1 \n\t" //add result to accumulator : :"r"(x),"m"(acc) :"%eax" ); } timer.stop(); cout<<"bsf64 method 2"<<endl <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl <<"result: "<<acc<<endl<<endl; acc = 0; timer.reset(); timer.start(); for(int x=0; x<TEST_DATA_64_SIZE; x++) { acc += bitScanReverse(test_data_q[x]); } timer.stop(); cout<<"double_conversion_msb64 method"<<endl <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl <<"result: "<<acc<<endl<<endl; //char asdf; //cin>>asdf; system("PAUSE"); return 0; }

cheers

by **Gerd Isenberg** » 05 Aug 2005, 21:03

[quote="Ochazuke"]Here are some test results I conducted on an athlon xp 2500+ barton @ 1.9G:

Code: Select all: debruijn32 method clock cycles: 14.2 result: 32505856 bsf32 method clock cycles: 11.2 result: 32505856 debruijn64 method clock cycles: 24.8 result: 33030144 bsf64 method 1 clock cycles: 22.2 result: 33030144 bsf64 method 2 clock cycles: 22.7 result: 33030144 double_conversion_msb64 method clock cycles: 41.0 result: 33030144

Hi Ochazuke,

funny, with my amd64 box, 32 bit mode, and my cycle guess framework for msc6 i get for the unsigned double_conversion_msb64

cycles by rdtsc = 64
cycles by loop = 65.296
time in ns = 29.680
foo 1017533129

Which is much more than your 41.
With the signed double conversion i get

cycles by rdtsc = 38
cycles by loop = 38.500
time in ns = 17.500
foo 1017533129

The "dirty" bsr one takes

cycles by rdtsc = 23
cycles by loop = 28.182
time in ns = 12.810
foo 1017533129

Cheers,
Gerd

Code: Select all: #include <stdio.h> #include <time.h> typedef unsigned __int64 BitBoard; unsigned int cycles; unsigned int cpuidRDTSCcycles; __forceinline void startRDTSC() { __asm { xor eax, eax cpuid rdtsc mov [cpuidRDTSCcycles], eax xor eax, eax cpuid rdtsc sub eax, [cpuidRDTSCcycles] mov [cpuidRDTSCcycles], eax xor eax, eax cpuid rdtsc mov [cpuidRDTSCcycles], eax xor eax, eax cpuid rdtsc sub eax, [cpuidRDTSCcycles] mov [cpuidRDTSCcycles], eax xor eax, eax cpuid rdtsc mov [cpuidRDTSCcycles], eax xor eax, eax cpuid rdtsc sub eax, [cpuidRDTSCcycles] mov [cpuidRDTSCcycles], eax xor eax, eax cpuid rdtsc mov [cycles], eax } } __forceinline void stopRDTSC() { __asm { xor eax, eax cpuid rdtsc sub eax, [cycles] sub eax, [cpuidRDTSCcycles] mov [cycles], eax } } // define one with 1 - the others with 0 //======================== #define UNSIGNED_DOUBLE 1 #define SIGNED_DOUBLE 0 #define BSR 0 #if UNSIGNED_DOUBLE + SIGNED_DOUBLE + BSR != 1 #error (Define one with 1 - the others with 0) #endif #if UNSIGNED_DOUBLE == 1 __forceinline unsigned int bitScanReverse(BitBoard bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & ~(bb >> 32)); return ud.exponent - 1023; } #endif #if SIGNED_DOUBLE == 1 union BB { BB(BitBoard b) {bb=b;} double getSignedDouble() {return (double)(__int64)bb;} BitBoard bb; struct { unsigned int lo; unsigned int hi; }; }; __forceinline unsigned int bitScanReverse(BB bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; bb.lo &= ~bb.hi; ud.d = bb.getSignedDouble(); unsigned int idx = (ud.exponent - 1023) | (63*ud.sign); return idx; } #endif #if BSR == 1 __forceinline unsigned int bitScanReverse(BitBoard bb) { __asm { bsr eax,[bb] bsr eax,[bb+4] setnz dl shl dl, 5 add al, dl } } #endif #define MAX_ITERATIONS 100000000 // 10**8 #define MYGHZ (2.2e9) void bitScanTest() { clock_t start, stop; int i, foo = 0; static BitBoard test[8] = { 0x80000f000f000000, 0x0000000f04400000, 0x2020020002000200, 0x100010010f010010, 0x010000f011111000, 0x0022222220202000, 0x0004040404040040, 0x000090009009f000 }; for ( i = 0; i < 8; i++) { startRDTSC(); foo += bitScanReverse(test[i]); stopRDTSC(); } printf("cycles by rdtsc = %d\n", cycles); start = clock(); for (i = 0; i < MAX_ITERATIONS; ++i) foo += bitScanReverse(test[i&7]); stop = clock(); printf("cycles by loop = %.3f\n", (float)(stop - start) / CLOCKS_PER_SEC * MYGHZ / MAX_ITERATIONS); printf("time in ns = %.3f\n", (float)(stop - start) / CLOCKS_PER_SEC * 1e9 / MAX_ITERATIONS); printf("foo %d\n", foo); } int main(int argc, char* argv[]) { bitScanTest(); return 0; }

Winboard Forum

Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

some test results

Re: some test results

Who is online