Best BitBoard LSB funktion?

Programming Topics (Computer Chess) and technical aspects as test techniques, book building, program tuning etc

Moderator: Andres Valverde

Re: Best BitBoard LSB funktion?

Postby Gerd Isenberg » 01 Aug 2005, 21:29

Dieter B?r?ner wrote:
Gerd Isenberg wrote:Since leading zero count is also necessary to convert ints to floats/double there are also some more or less portable tricks to interprete the binary representation of a double, the base two exponent of a normalized mantissa.


Interesting thought. Especially in C99, it should be very portable for 32-bit integers. C99 has the ilogb() function (it was available since long on many systems). ilogb() should even get inlined by a good compiler. I think, it will not work on 64 bit integers, however. We can pretty much assume IEEE 754 floating point representation (and it can even be checked), but we cannot assume, that we have a floating point type, that has enough accuracy for 64 bit numbers.

Cheers,
Dieter


Hi Dieter,

didn't you once post some c-code in CCC where 64-bit x &-x isolated LSB was converted to a double, while an unsigned char pointer was alialising that double, interpreting the exponent (and sign bit)?

Since we are only interested in MSB some rounding toward zero or down

AMD64 Architecture Programmer?s Manual
Volume 1: Application Programming
Page 158 4.4.9 Floating-Point Rounding


affects only the lowest bit of the normalzed mantissa without any overflow. I guess even a float convert is exact enough to determine the MSB by the exponent that way.

But if i look for appropriate sse or 3dnow instructions, converting 64-bit int, there seems only CVTSI2SD - with 11 cycles double dispatch not particular fast, and - converting to scalar 4 byte float with CVTSI2SS is even 14 cycles vector path.

There are faster SIMD instructions, like CVTDQ2PS (5 cycles double dispatch), converting vectors of four 32-bit ints to four 32-bit floats - but i fear it is not worth, also due to the hussle with none default rounding control Bits 14?13 of the MXCSR control and status register.

I once tried LSB by 3DNow/mmx-version via PI2FD which is only 4 cycles direct path - not that bad if you like to bitscan in mmx ;-)

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Best BitBoard LSB funktion?

Postby Anonymous » 02 Aug 2005, 19:36

Hi Gerd,

> didn't you once post some c-code in CCC where 64-bit x &-x isolated
> LSB was converted to a double, while an unsigned char pointer was
> alialising that double, interpreting the exponent (and sign bit)?

Sounds rather possible. But this case is much easier. And it is even almost portable (when we assume IEEE fp, which is the case for any modern system I know, and when we take care of the endianess). It can be coded in pure C.

> Since we are only interested in MSB some rounding toward zero or down

Correct. Indeed with rounding mode set to round down, it will work even in float precision. But there is no way, to set the rounding mode in a "more or less portable [and efficient] way". (with C99, it is possible in principle). It is also clear, that it would be very inefficient to set and reset the rounding mode before and after every call to the find_msb() function. So you would need to set the rounding mode once, and not reset it afterwards. But could you be sure, that the compiler will leave it that way? For example, a compiler might change the rounding mode when casting an integer to a float. Then it might reset the rounding mode again to the default rounding mode (not to the rounding mode, you had set). It might even be possible, that some library function (or some compiler generated code in general) depends on the rounding mode to be set to default, and might bug, when it is not. Even when tests would show, that it works, you couldn't really be sure, that it will still work after some totally unrelated code changes.

With x86, you also need a branch, to load an unsigned 64-bit integer to a floating point register (there is only an instruction to load a signed 64-bit integer). So, all in all, does not sound too attractive to me.

I cannot comment on the SIMD, 3dnow, ... tricks.

Cheers,
Dieter
Anonymous
 

Re: Best BitBoard LSB funktion?

Postby Anonymous » 02 Aug 2005, 19:45

Dann Corbit wrote:3 EXAMPLE An example of undefined behavior is the behavior on integer overflow."

But it is just an example.

From section "J.2 Undefined behavior" we do have this:
"? The value of the result of an integer arithmetic or conversion function cannot be represented (7.8.2.1, 7.8.2.2, 7.8.2.3, 7.8.2.4, 7.20.6.1, 7.20.6.2, 7.20.1)."



Thanks, Dann! This was exactly, what I was looking for (and what I expected). It means that even

Code: Select all
long nodes;

int absearch()
{
  nodes++;
  [...]
}


can yield in undefined behaviour (when calculating enough nodes).

Regards,
Dieter
Anonymous
 

Re: Best BitBoard LSB funktion?

Postby Dann Corbit » 02 Aug 2005, 19:50

Dieter B?r?ner wrote:
Dann Corbit wrote:3 EXAMPLE An example of undefined behavior is the behavior on integer overflow."

But it is just an example.

From section "J.2 Undefined behavior" we do have this:
"? The value of the result of an integer arithmetic or conversion function cannot be represented (7.8.2.1, 7.8.2.2, 7.8.2.3, 7.8.2.4, 7.20.6.1, 7.20.6.2, 7.20.1)."



Thanks, Dann! This was exactly, what I was looking for (and what I expected). It means that even

Code: Select all
long nodes;

int absearch()
{
  nodes++;
  [...]
}


can yield in undefined behaviour (when calculating enough nodes).

Regards,
Dieter


Yes. and conversely:

Code: Select all
unsigned long nodes;

int absearch()
{
  nodes++; /* This cannot cause undefined behavior */
  /*... */
}

Dann Corbit
 

Re: Best BitBoard LSB funktion?

Postby Gerd Isenberg » 02 Aug 2005, 20:49

Dieter B?r?ner wrote:Hi Gerd,

> didn't you once post some c-code in CCC where 64-bit x &-x isolated
> LSB was converted to a double, while an unsigned char pointer was
> alialising that double, interpreting the exponent (and sign bit)?

Sounds rather possible. But this case is much easier. And it is even almost portable (when we assume IEEE fp, which is the case for any modern system I know, and when we take care of the endianess). It can be coded in pure C.

> Since we are only interested in MSB some rounding toward zero or down

Correct. Indeed with rounding mode set to round down, it will work even in float precision. But there is no way, to set the rounding mode in a "more or less portable [and efficient] way". (with C99, it is possible in principle). It is also clear, that it would be very inefficient to set and reset the rounding mode before and after every call to the find_msb() function. So you would need to set the rounding mode once, and not reset it afterwards. But could you be sure, that the compiler will leave it that way? For example, a compiler might change the rounding mode when casting an integer to a float. Then it might reset the rounding mode again to the default rounding mode (not to the rounding mode, you had set). It might even be possible, that some library function (or some compiler generated code in general) depends on the rounding mode to be set to default, and might bug, when it is not. Even when tests would show, that it works, you couldn't really be sure, that it will still work after some totally unrelated code changes.

With x86, you also need a branch, to load an unsigned 64-bit integer to a floating point register (there is only an instruction to load a signed 64-bit integer). So, all in all, does not sound too attractive to me.

I cannot comment on the SIMD, 3dnow, ... tricks.

Cheers,
Dieter



Hi Dieter,

yes, to avoid further rounding issues one may reset MSB>>32.
This one works with msc for my box, but i fear on x87-platforms it is much too slow.

Gerd

Code: Select all
typedef unsigned __int64 BitBoard;

// return index 0..63 of MSB
//          -1023 if passing zero

unsigned int bitScanReverse(BitBoard bb)
{
   union {
      double d;
      struct {
         unsigned int mantissal : 32;      
         unsigned int mantissah : 20;      
         unsigned int exponent : 11;      
         unsigned int sign : 1;      
      };
   } ud;
   ud.d = (double)(bb & ~(bb >> 32));
   return ud.exponent - 1023;
}


a bit faster looks a signed conversion - if one inspects the assembly.

Code: Select all
unsigned int bitScanReverse(BitBoard bb)
{
   union {
      double d;
      struct {
         unsigned int mantissal : 32;      
         unsigned int mantissah : 20;      
         unsigned int exponent : 11;      
         unsigned int sign : 1;      
      };
   } ud;
   ud.d = (double)(__int64)(bb & ~(bb >> 32));
   unsigned int idx = (ud.exponent - 1023) | (63*ud.sign);
// printf ("0x%08x%08x %5d %25.4f\n", (unsigned int)(bb>>32), (unsigned int)bb, idx, ud.d );
   return idx;
}

Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Best BitBoard LSB funktion?

Postby Gerd Isenberg » 02 Aug 2005, 23:03

This looks already competitive:

Code: Select all
typedef unsigned __int64 BitBoard;

union BB
{
   BB(BitBoard b) {bb=b;}
   double getSignedDouble() {return (double)(__int64)bb;}

   BitBoard bb;
   struct {
      unsigned int lo;
      unsigned int hi;
   };
};

unsigned int bitScanReverse(BB bb)
{
   union {
      double d;
      struct {
         unsigned int mantissal : 32;      
         unsigned int mantissah : 20;      
         unsigned int exponent : 11;      
         unsigned int sign : 1;      
      };
   } ud;
   bb.lo &= ~bb.hi;
   ud.d = bb.getSignedDouble();
   unsigned int idx = (ud.exponent - 1023) | (63*ud.sign);
//   printf ("0x%08x%08x %5d %25.4f\n", bb.hi, bb.lo, idx, ud.d );
   return idx;
}


the assembly with 6,2 cycle fild and fstp instructions looks not that bad:

Code: Select all
?bitScanReverse@@YAITBB@@@Z PROC NEAR         ; bitScanReverse
; File C:\Source\bitScan\bitScan.cpp
; Line 34
  00000   8b 44 24 08    mov    eax, DWORD PTR _bb$[esp]
  00004   8b 4c 24 04    mov    ecx, DWORD PTR _bb$[esp-4]
  00008   f7 d0       not    eax
  0000a   23 c8       and    ecx, eax
  0000c   89 4c 24 04    mov    DWORD PTR _bb$[esp-4], ecx
; Line 35
  00010   df 6c 24 04    fild    QWORD PTR _bb$[esp-4]
  00014   dd 5c 24 04    fstp    QWORD PTR _ud$[esp-4]
; Line 38
  00018   8b 4c 24 08    mov    ecx, DWORD PTR _ud$[esp]
  0001c   8b c1       mov    eax, ecx
  0001e   c1 e9 1f    shr    ecx, 31         ; 0000001fH
  00021   c1 e8 14    shr    eax, 20         ; 00000014H
  00024   8b d1       mov    edx, ecx
  00026   25 ff 07 00 00    and    eax, 2047      ; 000007ffH
  0002b   c1 e2 06    shl    edx, 6
  0002e   2d ff 03 00 00    sub    eax, 1023      ; 000003ffH
  00033   2b d1       sub    edx, ecx
  00035   0b c2       or    eax, edx
; Line 39
  00037   c3       ret    0
?bitScanReverse@@YAITBB@@@Z ENDP         ; bitScanReverse
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Best BitBoard LSB funktion?

Postby Anonymous » 02 Aug 2005, 23:14

Gerd Isenberg wrote:
Code: Select all
typedef unsigned __int64 BitBoard;

// return index 0..63 of MSB
//          -1023 if passing zero

unsigned int bitScanReverse(BitBoard bb)
{
   union {
      double d;
      struct {
         unsigned int mantissal : 32;      
         unsigned int mantissah : 20;      
         unsigned int exponent : 11;      
         unsigned int sign : 1;      
      };
   } ud;
   ud.d = (double)(bb & ~(bb >> 32));
   return ud.exponent - 1023;
}



Wow, very smart, Gerd! I took me a while, to figure out, how this works. I wonder if any other readers of this read (there will be few readers that follow this thread that deep, I guess) did figure out the bb & ~(bb >> 32) fast.

I did not try it, but it really looks, as if it gets rid of the rounding/accuracy issue.

Cheers,
Dieter

PS. Personally, I'd prefer using the same method, while aliasing unsigned char to the double, instead of the union with the bitfields. Practically, it probably will not make a huge difference. Aliasing the unsigned char should be more portable. As you probably know, in C nothing is guaranteed, when you put in one field into a union, and take out another field. Neither is it well defined, how bitfiels are layed out. Aliasing to unsigned char is better defined: for example, it is guaranteed, that unsigned char has no padding bits and a pure binary representation. So, the only portability issue would be, to assume IEEE fp representation, assume CHAR_BITS == 8, and take care of the endianess (and possibly the alignement of the double). When using the bitfields inside the union, you have to assume more. Certainly, using bitfields as you did above, will look clearer.

PS2: Many C environments (not only on x86) have a long double type, that has enough accuracy, to just cast the 64 bit integer to long double and inspect the exponent. Older versions of MSVC did have such a type (for newer versions double and long double are the same), also Gcc environments I know for x86 do support such a type. On x86 in this context, there is basically no penalty, to use 80 bit floating point type.
Anonymous
 

Re: Best BitBoard LSB funktion?

Postby Anonymous » 02 Aug 2005, 23:39

Gerd Isenberg wrote:This looks already competitive:

[code]
?bitScanReverse@@YAITBB@@@Z PROC NEAR ; bitScanReverse
; File C:\Source\bitScan\bitScan.cpp
; Line 34
00000 8b 44 24 08 mov eax, DWORD PTR _bb$[esp]
00004 8b 4c 24 04 mov ecx, DWORD PTR _bb$[esp-4]
00008 f7 d0 not eax
0000a 23 c8 and ecx, eax
0000c 89 4c 24 04 mov DWORD PTR _bb$[esp-4], ecx
; Line 35
00010 df 6c 24 04 fild QWORD PTR _bb$[esp-4]
00014 dd 5c 24 04 fstp QWORD PTR _ud$[esp-4]
; Line 38
00018 8b 4c 24 08 mov ecx, DWORD PTR _ud$[esp]
0001c 8b c1 mov eax, ecx
0001e c1 e9 1f shr ecx, 31 ; 0000001fH
00021 c1 e8 14 shr eax, 20 ; 00000014H
00024 8b d1 mov edx, ecx
00026 25 ff 07 00 00 and eax, 2047 ; 000007ffH
0002b c1 e2 06 shl edx, 6
0002e 2d ff 03 00 00 sub eax, 1023 ; 000003ffH
00033 2b d1 sub edx, ecx
00035 0b c2 or eax, edx
; Line 39
00037 c3 ret 0
?bitScanReverse@@YAITBB@@@Z ENDP ; bitScanReverse


Agreed, looks very competitive. One can even hope, that (in the future) the floating point instructions can be calculated in parallel to some integer instructions (with the help of some smart compiler). On x86, there will probably be too few registers available, however.

I should mention, that your "*63" trick again is very smart!

Regards,
Dieter
Anonymous
 

some test results

Postby Anonymous » 04 Aug 2005, 07:28

Here are some test results I conducted on an athlon xp 2500+ barton @ 1.9G:

Code: Select all
debruijn32 method
clock cycles: 14.2
result: 32505856

bsf32 method
clock cycles: 11.2
result: 32505856

debruijn64 method
clock cycles: 24.8
result: 33030144

bsf64 method 1
clock cycles: 22.2
result: 33030144

bsf64 method 2
clock cycles: 22.7
result: 33030144

double_conversion_msb64 method
clock cycles: 41.0
result: 33030144


with this code:

Code: Select all
#include <iostream>
using namespace std;
#include <windows.h>


typedef unsigned int uint;

typedef char        int8;
typedef short       int16;
typedef int         int32;
typedef long long   int64;

typedef unsigned char       uint8;
typedef unsigned short      uint16;
typedef unsigned int        uint32;
typedef unsigned long long  uint64;


class win_timer
{
   public:
      win_timer();
        void start();
      void stop();
      void reset();
      double reading();
      bool running();
      
   private:
        LARGE_INTEGER time;
        double frequency;
        bool rn;
};

win_timer::win_timer()
{
    LARGE_INTEGER f;
    QueryPerformanceFrequency(&f);
    frequency = f.QuadPart;
    stop();
    reset();
}

void win_timer::start()
{
    if(!rn)
    {
        rn = true;
        LARGE_INTEGER f;
        QueryPerformanceCounter(&f);
        time.QuadPart = f.QuadPart - time.QuadPart;
    }
}

void win_timer::stop()
{
    if(rn)
    {
        LARGE_INTEGER f;
        QueryPerformanceCounter(&f);
        time.QuadPart = f.QuadPart - time.QuadPart;
        rn = false;
    }
}

void win_timer::reset()
{
    if(rn)
    {
        LARGE_INTEGER f;
        QueryPerformanceCounter(&f);
        time = f;
    }
    else
    {
        time.QuadPart = 0;
    }
}

double win_timer::reading()
{
    if(rn)
    {
        LARGE_INTEGER f;
        QueryPerformanceCounter(&f);
        return (f.QuadPart - time.QuadPart)/frequency;
    }
    else
    {
        return time.QuadPart/frequency;
    }
}

bool win_timer::running()
{
    return rn;
}


#define debruijn32 0x077cb531UL
int debruijn32_index[32];

void debruijn32_init()
{
    for(int i=0; i<32; i++)
        debruijn32_index[(debruijn32 << i) >> 27] = i;
}

int debruijn32_lsb(uint32 x)
{
    return debruijn32_index[((x & -x) * debruijn32) >> 27];
}


/* de Bruijn 64*/
const int lsz64_tbl[64] =
{
    0, 31,  4, 33, 60, 15, 12, 34,
   61, 25, 51, 10, 56, 20, 22, 35,
   62, 30,  3, 54, 52, 24, 42, 19,
   57, 29,  2, 44, 47, 28,  1, 36,
   63, 32, 59,  5,  6, 50, 55,  7,
   16, 53, 13, 41,  8, 43, 46, 17,
   26, 58, 49, 14, 11, 40,  9, 45,
   21, 48, 39, 23, 18, 38, 37, 27,
};

int debruijn64_lsb(uint64 bb)
{
   const uint64 lsb = (bb & -int64(bb)) - 1;
   const uint32 foldedLSB = int32(lsb) ^ int32(lsb >> 32);
   return lsz64_tbl[(foldedLSB * 0x78291ACF) >> 26];
}


typedef uint64 BitBoard;

// return index 0..63 of MSB
//          -1023 if passing zero

unsigned int bitScanReverse(BitBoard bb)
{
   union {
      double d;
      struct {
         unsigned int mantissal : 32;     
         unsigned int mantissah : 20;     
         unsigned int exponent : 11;     
         unsigned int sign : 1;     
      };
   } ud;
   ud.d = (double)(bb & ~(bb >> 32));
   return ud.exponent - 1023;
}


#define CLOCKSPEED 1900000000 //my processor runs at this many hertz
#define TEST_DATA_SIZE (8*1024*1024/4) //makes an 8 meg table of dwords
#define TEST_DATA_64_SIZE (TEST_DATA_SIZE/2)//makes an 8 meg table of qwords

uint32 test_data[TEST_DATA_SIZE];
uint64 test_data_q[TEST_DATA_64_SIZE];


int main(int argc, char *argv[])
{
    debruijn32_init();
    for(int a=0; a<TEST_DATA_SIZE;)
        for(int b=0 ;a<TEST_DATA_SIZE,b<32; a++,b++)
            test_data[a] = uint32(1)<<b;
   
    uint acc = 0;//accumulator
    win_timer timer;
   
    timer.start();
    for(int x=0; x<TEST_DATA_SIZE; x++)
    {
        acc += debruijn32_lsb(test_data[x]);
    }
    timer.stop();
    cout<<"debruijn32 method"<<endl
        <<"clock cycles: "<<(timer.reading() / TEST_DATA_SIZE * CLOCKSPEED)<<endl
        <<"result: "<<acc<<endl<<endl;
   
    acc = 0;
    timer.reset();
    timer.start();
    for(int x=0; x<TEST_DATA_SIZE; x++)
    {
        asm(
            "movl _test_data(,%0,4),%%edx   \n\t" //load test_data[x] into edx
            "bsfl %%edx,%%eax               \n\t" //perform bsf (bit scan forward)
            "addl %%eax,%1                  \n\t" //add result to accumulator
            :
            :"r"(x),"m"(acc)
            :"%eax","%edx"
        );
    }
    timer.stop();
    cout<<"bsf32 method"<<endl
        <<"clock cycles: "<<(timer.reading() / TEST_DATA_SIZE * CLOCKSPEED)<<endl
        <<"result: "<<acc<<endl<<endl;
       
    for(int a=0; a<TEST_DATA_64_SIZE;)
        for(int b=0 ;a<TEST_DATA_64_SIZE,b<64; a++,b++)
            test_data_q[a] = uint64(1)<<b;
           
    acc = 0;
    timer.reset();
    timer.start();
    for(int x=0; x<TEST_DATA_64_SIZE; x++)
    {
        acc += debruijn64_lsb(test_data_q[x]);
    }
    timer.stop();
    cout<<"debruijn64 method"<<endl
        <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl
        <<"result: "<<acc<<endl<<endl;
       
    acc = 0;
    timer.reset();
    timer.start();
    for(int x=0; x<TEST_DATA_64_SIZE; x++)
    {
        asm("xorl %%edx,%%edx                       \n\t" //zero edx
            "xorl %%ecx,%%ecx                       \n\t" //zero ecx
            "orl _test_data_q(,%0,8),%%edx          \n\t" //copy lower dword of test_data_q[x] to edx and set flags
            "jnz skippy                             \n\t" //if(edx != 0) go to skippy
                "movl _test_data_q+4(,%0,8),%%edx   \n\t" //else{ copy upper dword of test_data_q[x] to edx
                "movl $32,%%ecx                     \n\t" //move 32 into ecx, will add this into result later }
            "skippy:                                \n\t"
            "bsfl %%edx,%%eax                       \n\t" //perform the bsf
            "addl %%ecx,%%eax                       \n\t" //add in ecx
            "addl %%eax,%1                          \n\t" //add result to accumulator
            :
            :"r"(x),"m"(acc)
            :"%eax","%ecx","%edx"
        );
    }
    timer.stop();
    cout<<"bsf64 method 1"<<endl
        <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl
        <<"result: "<<acc<<endl<<endl;
       
    acc = 0;
    timer.reset();
    timer.start();
    for(int x=0; x<TEST_DATA_64_SIZE; x++)
    {
        asm("bsfl _test_data_q+4(,%0,8),%%eax   \n\t" //bsf upper dword of test_data_q[x] into eax
            "addl $32,%%eax                     \n\t" //add 32 to eax
            "bsfl _test_data_q(,%0,8),%%eax     \n\t" //bsf lower dword of test_data_q[x] into eax. If(lower dword of test_data_q[x] == 0) eax is not modified.
            "addl %%eax,%1                      \n\t" //add result to accumulator
            :
            :"r"(x),"m"(acc)
            :"%eax"
        );
    }
    timer.stop();
    cout<<"bsf64 method 2"<<endl
        <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl
        <<"result: "<<acc<<endl<<endl;
       
    acc = 0;
    timer.reset();
    timer.start();
    for(int x=0; x<TEST_DATA_64_SIZE; x++)
    {
        acc += bitScanReverse(test_data_q[x]);
    }
    timer.stop();
    cout<<"double_conversion_msb64 method"<<endl
        <<"clock cycles: "<<(timer.reading() / TEST_DATA_64_SIZE * CLOCKSPEED)<<endl
        <<"result: "<<acc<<endl<<endl;

    //char asdf;
    //cin>>asdf;
    system("PAUSE");
    return 0;
}


cheers :mrgreen:
Anonymous
 

Re: some test results

Postby Gerd Isenberg » 05 Aug 2005, 21:03

[quote="Ochazuke"]Here are some test results I conducted on an athlon xp 2500+ barton @ 1.9G:

Code: Select all
debruijn32 method
clock cycles: 14.2
result: 32505856

bsf32 method
clock cycles: 11.2
result: 32505856

debruijn64 method
clock cycles: 24.8
result: 33030144

bsf64 method 1
clock cycles: 22.2
result: 33030144

bsf64 method 2
clock cycles: 22.7
result: 33030144

double_conversion_msb64 method
clock cycles: 41.0
result: 33030144


Hi Ochazuke,

funny, with my amd64 box, 32 bit mode, and my cycle guess framework for msc6 i get for the unsigned double_conversion_msb64

cycles by rdtsc = 64
cycles by loop = 65.296
time in ns = 29.680
foo 1017533129

Which is much more than your 41.
With the signed double conversion i get

cycles by rdtsc = 38
cycles by loop = 38.500
time in ns = 17.500
foo 1017533129

The "dirty" bsr one takes

cycles by rdtsc = 23
cycles by loop = 28.182
time in ns = 12.810
foo 1017533129

Cheers,
Gerd

Code: Select all
#include <stdio.h>
#include <time.h>

typedef unsigned __int64 BitBoard;

unsigned int cycles;
unsigned int cpuidRDTSCcycles;

__forceinline void startRDTSC()
{
   __asm
   {
      xor      eax, eax
      cpuid
      rdtsc
      mov [cpuidRDTSCcycles], eax
      xor      eax, eax
      cpuid
      rdtsc
      sub eax, [cpuidRDTSCcycles]
      mov [cpuidRDTSCcycles], eax

      xor      eax, eax
      cpuid
      rdtsc
      mov [cpuidRDTSCcycles], eax
      xor      eax, eax
      cpuid
      rdtsc
      sub eax, [cpuidRDTSCcycles]
      mov [cpuidRDTSCcycles], eax

      xor      eax, eax
      cpuid
      rdtsc
      mov [cpuidRDTSCcycles], eax
      xor      eax, eax
      cpuid
      rdtsc
      sub eax, [cpuidRDTSCcycles]
      mov [cpuidRDTSCcycles], eax

      xor      eax, eax
      cpuid
      rdtsc
      mov [cycles], eax
   }
}


__forceinline void stopRDTSC()
{
   __asm
   {
      xor      eax, eax
      cpuid
      rdtsc
      sub eax, [cycles]
      sub eax, [cpuidRDTSCcycles]
      mov [cycles], eax
   }
}

// define one with 1 - the others with 0
//========================
#define UNSIGNED_DOUBLE 1
#define SIGNED_DOUBLE 0
#define BSR 0

#if UNSIGNED_DOUBLE + SIGNED_DOUBLE + BSR != 1
#error (Define one with 1 - the others with 0)
#endif

#if UNSIGNED_DOUBLE == 1
__forceinline
unsigned int bitScanReverse(BitBoard bb)
{
   union {
      double d;
      struct {
         unsigned int mantissal : 32;     
         unsigned int mantissah : 20;     
         unsigned int exponent : 11;     
         unsigned int sign : 1;     
      };
   } ud;
   ud.d = (double)(bb & ~(bb >> 32));
   return ud.exponent - 1023;
}
#endif

#if SIGNED_DOUBLE == 1
union BB
{
   BB(BitBoard b) {bb=b;}
   double getSignedDouble() {return (double)(__int64)bb;}

   BitBoard bb;
   struct {
      unsigned int lo;
      unsigned int hi;
   };
};

__forceinline
unsigned int bitScanReverse(BB bb)
{
   union {
      double d;
      struct {
         unsigned int mantissal : 32;      
         unsigned int mantissah : 20;      
         unsigned int exponent : 11;      
         unsigned int sign : 1;      
      };
   } ud;
   bb.lo &= ~bb.hi;
   ud.d = bb.getSignedDouble();
   unsigned int idx = (ud.exponent - 1023) | (63*ud.sign);
   return idx;
}
#endif

#if BSR == 1
__forceinline
unsigned int bitScanReverse(BitBoard bb)
{
   __asm
   {
      bsr      eax,[bb]
      bsr      eax,[bb+4]
      setnz   dl
      shl      dl, 5
      add      al, dl
   }
}
#endif


#define MAX_ITERATIONS 100000000 // 10**8
#define MYGHZ (2.2e9)

void bitScanTest()
{
   clock_t start, stop;
   int i, foo = 0;

   static BitBoard test[8] = {
      0x80000f000f000000,
      0x0000000f04400000,
      0x2020020002000200,
      0x100010010f010010,
      0x010000f011111000,
      0x0022222220202000,
      0x0004040404040040,
      0x000090009009f000
   };
   for ( i = 0; i < 8; i++) {
      startRDTSC();
      foo += bitScanReverse(test[i]);
      stopRDTSC();
   }
   printf("cycles by rdtsc = %d\n", cycles);
   start = clock();
   for (i = 0; i < MAX_ITERATIONS; ++i)
      foo += bitScanReverse(test[i&7]);
   stop = clock();
   printf("cycles by loop  = %.3f\n", (float)(stop - start) / CLOCKS_PER_SEC * MYGHZ / MAX_ITERATIONS);
   printf("time in ns      = %.3f\n", (float)(stop - start) / CLOCKS_PER_SEC *   1e9 / MAX_ITERATIONS);
   printf("foo %d\n", foo);
}


int main(int argc, char* argv[])
{
   bitScanTest();
   return 0;
}

Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Previous

Return to Programming and Technical Discussions

Who is online

Users browsing this forum: No registered users and 20 guests