Winboard Forum

by **Reinhard Scharnagl** » 22 Jul 2005, 08:55

Here comes a comparision to your last big table using procedure, which will be clearly in front:

Code: Select all: Initialized rand Table: 1235 ms Average population = 50.0% popCount2: 2171 ms (320017476) popCount: 1657 ms (320017476) Initialized rand Table: 2782 ms Average population = 25.0% popCount2: 2156 ms (160006818) popCount: 1484 ms (160006818) Initialized rand Table: 2609 ms Average population = 12.5% popCount2: 2172 ms (80002405) popCount: 1219 ms (80002405) Initialized rand Table: 2344 ms Average population = 6.3% popCount2: 2172 ms (40733263) popCount: 1062 ms (40733263) Initialized rand Table: 2187 ms Average population = 3.1% popCount2: 2172 ms (20209492) popCount: 985 ms (20209492) Initialized rand Table: 2125 ms Average population = 1.6% popCount2: 2156 ms (10452148) popCount: 969 ms (10452148)

Reinhard.

P.S.: the dependance to the stage of population seems to be a cpu caching effect. Thus it might be worse in an engine doing more than this test.

by **Leen Ammeraal** » 22 Jul 2005, 13:52

Thanks, Reinhard, for your tests with various bit densities.
This was particularly useful to me with regard to your routines
bitScan and bitScan2. On the basis of your earlier postings, I had
assumed that bitScan2 was always better than bitScan, and when
I found that bitScan2 made my engine (slightly) faster, I adopted it.
However, after reading your last postings and suspecting that my
engine mainly deals with sparsely populated bitboards,
it seemed worthwhile to me to test bitScan as
well, and this gave indeed another small improvement. To give
you an impression how little this improvement is, let me
give you the times on my computer to reach depth 13 for
wac163. Since my timing measurements always give different
results, I did five tests for each of the two bit scan routines.
Here are the times in ms, as reported by epd2wb:
With bitScan2: 25120, 25150, 25210, 25230, 25100;
with bitScan: 25030, 25060, 25120, 25040, 25040.
Apparently, most of the time my engine is not performing
bitscan operations, otherwise the difference would have
been more interesting. Nevertheless, I will now use your
routine bitScan instead of bitScan2, hoping that this way the ELO
rating will improve by about 1 point -:).
Leen

by **Reinhard Scharnagl** » 22 Jul 2005, 14:00

Hello Leen,

probably the method you seem to reference is not from me. It seems to be a de Bruijn related solution posted here some time ago.

Reinhard.

by **Leen Ammeraal** » 22 Jul 2005, 17:34

Reinhard Scharnagl wrote:Hello Leen,

probably the method you seem to reference is not from me. It seems to be a de Bruijn related solution posted here some time ago.

Reinhard.

Hello Reinhard,
Yes, I am referring to this function:

Code: Select all: /* de Bruijn */ const int lsz64_tbl[64] = { 0, 31, 4, 33, 60, 15, 12, 34, 61, 25, 51, 10, 56, 20, 22, 35, 62, 30, 3, 54, 52, 24, 42, 19, 57, 29, 2, 44, 47, 28, 1, 36, 63, 32, 59, 5, 6, 50, 55, 7, 16, 53, 13, 41, 8, 43, 46, 17, 26, 58, 49, 14, 11, 40, 9, 45, 21, 48, 39, 23, 18, 38, 37, 27, }; int bitScan (const U64 bb) { const U64 lsb = (bb & -(long long) bb) - 1; const unsigned int foldedLSB = ((int) lsb) ^ ((int) (lsb >> 32)); return lsz64_tbl[foldedLSB * 0x78291ACF >> 26]; }

I hope you don't mind that (in view of better performance for
sparsely populated bitboards) I prefer it even though you did not
write it yourself! I would probably have used your function bitScan2
if on average there had been more 1-bits in my bitboards.
Leen

by **Reinhard Scharnagl** » 22 Jul 2005, 17:46

Leen,

maybe the source is from http://my.fit.edu/~mtaylor/tzc.html, and the correct name could be M. Taylor (?) using De Bruijin (wrong spelled in code).

I have tried to apply that principle to splitted data, but it has been not that fast, but faster than bitScan2() overall. Nevertheless it could be useful at processors with no fast 32 Bit multiplication:

Code: Select all: /* BitBoard LSB number search */ /* (R. Scharnagl, 2005/Jul/22) */ /* (zero board will cause -64) */ const unsigned lsbRS_lo[32] = { 0, 15, 3, 17, 29, 10, 8, 18, 30, 14, 2, 24, 26, 13, 1, 19, 31, 16, 28, 4, 5, 23, 25, 6, 11, 27, 22, 9, 7, 21, 20, 12 }; const unsigned lsbRS_hi[32] = { 32, 47, 35, 49, 61, 42, 40, 50, 62, 46, 34, 56, 58, 45, 33, 51, 63, 48, 60, 36, 37, 55, 57, 38, 43, 59, 54, 41, 39, 53, 52, 44 }; int bitScan3(const U64 b) { unsigned short fold; int buf; if ((buf = (int)b) != 0) { buf &= -buf; --buf; fold = buf ^ (buf >> 16); fold *= 0x70a7; return lsbRS_lo[fold >> (16-5)]; } if ((buf = (int)(b>>32)) != 0) { buf &= -buf; --buf; fold = buf ^ (buf >> 16); fold *= 0x70a7; return lsbRS_hi[fold >> (16-5)]; } return -64; }

Reinhard.

by **Gerd Isenberg** » 22 Jul 2005, 18:08

Hi Reinhard, Leen and all others,

wow - a huge bitscan discussion, i can't believe ;-)

Some bitscan pioneers to mention:

Walter Faxon
Inventor (in the 80ties!) of the magic 64-bit bitscan with some xor of an empirical determined constant and folding down to a byte. There are zillions of variations with +=/-= folding, with a lookup size of ~160.

Eugene Nalimov
published the conditional reverse bitscan with 8- or 16-bit lookup in CCC.

Charles E- Leiserson, Harald Prokop and Keith H. Randelll from MIT
"Using de Bruijn Sequences to index a 1 in a Computer Word"
http://supertech.csail.mit.edu/papers/debruijn.pdf

The imho best bsf-replacement for x86-64, since mul is very fast here and bsf reg64,reg64 is still 9-cycles vector path.

Matt Taylor
invented the folded De Bruijn trick. This is imho the best replacement for bsf pairs on 32-bit CPUs with fast mul.

btw. Matt is regular poster in comp.lang.asm.x86.

Cheers,
Gerd

by **Leen Ammeraal** » 22 Jul 2005, 18:46

Reinhard Scharnagl wrote:Leen,

...using De Bruijin (wrong spelled in code).

Reinhard.

No, Reinhard, the name should be written De Bruijn, which is a rather
well-known Dutch name, so it was correct in the code, except that
it should have been "De" instead of "de" if initials are absent.
We write: "Professor De Bruijn", but "Professor N. G. de Bruijn",
but I am afraid only few Dutchmen know this rule, let alone
people from other countries.

It seems that both your new version bitScan3 and Gerd Isenberg's
message provide me with some more homework.
Leen

by **Gerd Isenberg** » 22 Jul 2005, 21:09

Leen Ammeraal wrote:
Reinhard Scharnagl wrote:Leen,

...using De Bruijin (wrong spelled in code).

Reinhard.

No, Reinhard, the name should be written De Bruijn, which is a rather
well-known Dutch name, so it was correct in the code, except that
it should have been "De" instead of "de" if initials are absent.
We write: "Professor De Bruijn", but "Professor N. G. de Bruijn",
but I am afraid only few Dutchmen know this rule, let alone
people from other countries.

It seems that both your new version bitScan3 and Gerd Isenberg's
message provide me with some more homework.
Leen

Hi Leen,

what about a 64-corner polygon swimming pool and Inge has to traverse De Bruijn graphs?

A De Bruijn sequence with binary digits (among other alphabets) for 64-bitscan purpose is a string of 64+5 bits or a 64-bit circular sequence, containing 64 overlapping unique six-bit strings. For convenience the circular sequence is implemented in a 64-bit word that way, that the five leading bits are wrapped to the "hidden" zero-bits "right" of the least significant bit 2**0. Or in other words, the five leading bits must be zero to make the sequence circular in a 64-bit word.

Now, when we multiply the De Bruijn sequence with a power of two, it is like shifting the De Bruijn sequence left. The upper six bits are the result - a unique six bit sequence. The shift amount by multiplying with a single bit varies from 0 (times one) until 63 (times 2**63). If we shift left by 63, 63 zeros come in from right to left and the remaining upper six bits represent either zero or 32.

The sequence must be odd, if we start with 0000001....1B,
The sequence must be even, if we start with 000001.....0B
(infact only a rotate left one of the former).

There are 67,108,864 or 0x04000000 64-bit De Bruijn sequences with six leading zeros. a (recursive) backtracking algorithm to generate or count all De Bruijn sequences is quite simple - or to traverse De Bruijn graphs.

Gerd

by **Leen Ammeraal** » 23 Jul 2005, 20:05

Thanks, Gerd. I had not realized that the Inge your refer to
has the same surname as N.G. de Bruijn.
The paper you referred to in your other message was also
reasonably readable.
Do you use assembly code for bitboard manipulation in your engine Isichess?
What is the rating of Isichess these days?
Regards,
Leen

by **Gerd Isenberg** » 23 Jul 2005, 22:12

Leen Ammeraal wrote:Thanks, Gerd. I had not realized that the Inge your refer to
has the same surname as N.G. de Bruijn.
The paper you referred to in your other message was also
reasonably readable.
Do you use assembly code for bitboard manipulation in your engine Isichess?
What is the rating of Isichess these days?
Regards,
Leen

Hi Leen,

i have one rotated version without any assembly, but Matt's bitscan with reset found bits. The mmx-version has inline assembly for mmx-fill-routines but not for traversing bitboards. Matt's routine is slightly faster on my current amd64 box in 32-bit mode than the bsf-inline asm bitscan i used before. One reason beside slow bsf vector path instructions - as already mentioned by Dann Corbit - inlining inline asm confuses msc optimizer.

Code: Select all: unsigned int bitScanAndReset(BitBoard &bb) { BitBoard b = bb ^ (bb - 1); bb = bb & (bb - 1); unsigned int fold = ((int) b) ^ ((int)(b>>32)); return lsz64_tbl[(fold * 0x78291ACF) >> (32-6)]; }

My "future" 64-bit approach don't bitscans at all, only x & -x bit isolation.
And hashing move-bitboards, e.g. the diff between from-to bitboard.
But i fear i need a sabatical to finish it due to my current job eats all my energy and time.

My "old" programs are not able to play automaticly - since mdi (simultanious play) had higher priority than winboard/uci as i started development in end of the 90ties by porting the mdi-gui-features of my old and commercial dos-program to windows.

So don't ask me about the rating. I like to play over the board tournaments and except last ipccc it wasn't so bad.

Gerd

by **Pradu** » 23 Jul 2005, 23:20

Gerd Isenberg wrote:My "future" 64-bit approach don't bitscans at all, only x & -x bit isolation.
Gerd

But wont putting moves in hash-table take up a LOT of memory?
Won't your move generation tables be massive without some sort of bitscan or log? (from-to) hashing will still be a big number unless I have misunderstood your statement.

by **Gerd Isenberg** » 24 Jul 2005, 09:14

Pradu wrote:
Gerd Isenberg wrote:My "future" 64-bit approach don't bitscans at all, only x & -x bit isolation.
Gerd

But wont putting moves in hash-table take up a LOT of memory?
Won't your move generation tables be massive without some sort of bitscan or log? (from-to) hashing will still be a big number unless I have misunderstood your statement.

Hi Pradu,
i mean fromBB-ToBB of a particular piece and direction is factor of a De Bruijn like multiplication, and to use the upper 6..9 bits as unique move index. It is a matter of choosing combinations of De Bruijn constants and some offsets for a compact move index range. Quite pawn-, knight-, king- and bishop-moves map directly into a range of 7680. Rook- and queen-moves require one additional lookup to fill the gaps in this 7680 area. 7680 is the size of history tables, where only quite moves are considered.

The total move index range is roughly the product of 7680 by all combinations of target squares, that is empty, x Pawn, x Knight, x Bishop, x Rook and x Queen. So the total table size for all moves is: 6*7680 = 46080 (0xB400). Otoh i have a compact, quad-bitboard based node-structure/class with sizeof 64 bytes (1 cacheline).

Domove is basically copy with xor/and-add of a 64-byte structure with a table slot indexed by moveidx := [0...0xB400]. I admit that the table is huge 45K*64 = 2880KByte, otoh moves are not that randomly during search so i still expect a lot of hits in L1-cache. I also have some tricks to hide the latency of that lookups and hopefully save memory bandwidth elsewhere.

The whole approach is based on sse2/3 instructions to work with bitboards SIMD wise.

Gerd

by **diepeveen** » 25 Jul 2005, 23:29

Reinhard Scharnagl wrote:Hi Leen,

concerning your question to different behavior of my shown routines I have worked out the dependancy of the stage of board population.

Code: Select all
Initialized rand Table: 1250 ms Average population = 50% bitScan2: 906 ms (9999221) bitScan: 1344 ms (9999221) Initialized rand Table: 2484 ms Average population = 25% bitScan2: 1078 ms (29997211) bitScan: 1344 ms (29997211) Initialized rand Table: 2485 ms Average population = 12% bitScan2: 1937 ms (69852941) bitScan: 1360 ms (69852941) Initialized rand Table: 2485 ms Average population = 6% bitScan2: 2953 ms (134895445) bitScan: 1344 ms (134895445) Initialized rand Table: 2484 ms Average population = 3% bitScan2: 3828 ms (185222843) bitScan: 1344 ms (185222843) Zero table bitScan2(0) = -64 bitScan(0) = 0

Code: Select all
Initialized rand Table: 1266 ms Average population = 50% popCount2: 2156 ms (320017476) popCount: 4328 ms (320017476) Initialized rand Table: 5485 ms Average population = 25% popCount2: 2172 ms (160006818) popCount: 4312 ms (160006818) Initialized rand Table: 5468 ms Average population = 12% popCount2: 2172 ms (80002405) popCount: 4313 ms (80002405) Initialized rand Table: 5469 ms Average population = 6% popCount2: 2172 ms (40733263) popCount: 4312 ms (40733263) Initialized rand Table: 5469 ms Average population = 3% popCount2: 2187 ms (20209492) popCount: 4328 ms (20209492) Zero table popCount2(0) = 0 popCount(0) = 0

It is to be seen, that bitScan2() is highly depending on population, whereas popCount2() isn't at all.

Reinhard.

Your Rand() generator is a tad slow by the way. See in the highend forums the ranrot postings i did. I had optimized to 64 bits
the ranrot generator which produces each 3.1 nanoseconds a new random number.

How many cycles is that function and at which processors did you test it?
My mainconcern is x86-64 by the way, this bitscan2 looks ugly. Could be a 100 cycles worst case.

Leen is mixing 8 bits code with 32 bits with 64 bits code. I dislike that principle too. I like the short way in which Leen expresses his code though. Small compact and generic.

Mixing 32 bits code with a little bit of 64 bits is already ugly enough.

A64 has 64KB datacache. It's unclear what the future will bring us.
If intel produces some new processor that's fast and has 8 cores,
it most likely will have like 16KB L1 datacache.

It's unlikely that lookups to L2 will get faster. Right now it's 13 cycles for Opteron, but that will get also worse then they clock'em higher.

Code that needs therefore small arrays is preferred.

Avoiding branches is a must.

A small test i did confirmed that with Diep. A fast beancounter (with checks in qsearch of course) i wrote with default diep functions (a makemove/unmakemove that's doing incremental attacktables inside it which slows down things) is running at 1.6 mln nps (K7 2.1Ghz) and default diep runs at 85k nps.

I want to speed it up a little.

In any case when i run it at an opteron processor (865) then it's suddenly the case that GCC generates real ugly looking code, but Diep gets 107k nps, whereas the fast beancounter hardly gets faster. Of course there is a Mhz difference.

K7 is 2.127Ghz and the opteron cpu at 1.8Ghz

But the idea is clear. I must make my arrays tinier whereever i can.

Note that the opteron has a huge space for codesize in L2 which matters a lot for the speed too.

But i was a bit amazed to see that todays processors it's hard to search real fast, whereas in far past i could rite a fast beancounter that searched at 500 cycles per node (at P5). At K7 i'm not even getting near 1000 cycles a node so far (when doing checks in qsearch). Very bad.

So all those lookup arrays of 16KB, when you benchmark them in a stand alone routine.

In reality you use 10 of such type arrays, and that just slows you down in the future.

We need branchless code with small arrays. Executing more instructions is NO problem at all, as long as they are not related and not demanding the RISC principle.

The popcount2 looks very clean code. It allows multiple instructions per cycle. Processors of today can do 3. That will get 4 in future i assume.
Cell can do even more than that.

by **diepeveen** » 25 Jul 2005, 23:41

Reinhard Scharnagl wrote:Leen,

maybe the source is from http://my.fit.edu/~mtaylor/tzc.html, and the correct name could be M. Taylor (?) using De Bruijin (wrong spelled in code).

I have tried to apply that principle to splitted data, but it has been not that fast, but faster than bitScan2() overall. Nevertheless it could be useful at processors with no fast 32 Bit multiplication:
Code: Select all
/* BitBoard LSB number search */ /* (R. Scharnagl, 2005/Jul/22) */ /* (zero board will cause -64) */ const unsigned lsbRS_lo[32] = { 0, 15, 3, 17, 29, 10, 8, 18, 30, 14, 2, 24, 26, 13, 1, 19, 31, 16, 28, 4, 5, 23, 25, 6, 11, 27, 22, 9, 7, 21, 20, 12 }; const unsigned lsbRS_hi[32] = { 32, 47, 35, 49, 61, 42, 40, 50, 62, 46, 34, 56, 58, 45, 33, 51, 63, 48, 60, 36, 37, 55, 57, 38, 43, 59, 54, 41, 39, 53, 52, 44 }; int bitScan3(const U64 b) { unsigned short fold; int buf; if ((buf = (int)b) != 0) { buf &= -buf; --buf; fold = buf ^ (buf >> 16); fold *= 0x70a7; return lsbRS_lo[fold >> (16-5)]; } if ((buf = (int)(b>>32)) != 0) { buf &= -buf; --buf; fold = buf ^ (buf >> 16); fold *= 0x70a7; return lsbRS_hi[fold >> (16-5)]; } return -64; }

Reinhard.

So you could get in theory 2 mispredictions. Let's say 1 misprediction on average. That's like 30 cycles or so, let's not count the rest.
1 compare adds at least 1 cycle. a 32 bits multiply that is 1 cycle.
And more importantly all instructions can get executed only at 1 instruction at a time.

More interesting is what Leen quotes, however without testing it,
is it going to run bugfree?

int bitScan (const U64 bb)
{
const U64 lsb = (bb & -(long long) bb) - 1;
const unsigned int foldedLSB = ((int) lsb) ^ ((int) (lsb >> 32));
return lsz64_tbl[foldedLSB * 0x78291ACF >> 26];
}

This is like 10 cycles when inlined, but why are the casts to signed int and signed long long needed?

I bet some compilers are going to have problems with that.

Of course in past i never followed Brown discussions at the internet,
so the generic proof that it works at all processors and for all compilers i missed a little.

But let me ask you this. All all processors you know and in the future, is this code going to run bugfree, yes also at highend processors?

Swear it on your mother!

Let's say with 64 bits integers...

by **diepeveen** » 25 Jul 2005, 23:50

better random generator. Please note that i'm using also other timing function. Not ftime but the real wall clock of course with gettimeofday.

To quote the linux manual about ftime: "this function is obsolete"

/* define parameters (R1 and R2 must be smaller than the integer size): */
#define KK 17
#define JJ 10
#define R1 5
#define R2 3

/* global variables Ranrot */
BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */
0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
/*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
};
int r_p1, r_p2; /* indexes into history buffer */

/* global variables RASML */
BITBOARD *hashtable[MAXPROCESSES],nentries,globaldummy=0;
GlobalTree *tree;
int ProcessNumber,
cpus; // number of processes for this test
#if UNIX
int shm_tree,shm_hash[MAXPROCESSES];
#endif
char rasmexename[2048];

/******************************************************** AgF 1999-03-03 *
* Random Number generator 'RANROT' type B *
* by Agner Fog *
* *
* This is a lagged-Fibonacci type of random number generator with *
* rotation of bits. The algorithm is: *
* X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b *
* *
* The last k values of X are stored in a circular buffer named *
* randbuffer. *
* *
* This version works with any integer size: 16, 32, 64 bits etc. *
* The integers must be unsigned. The resolution depends on the integer *
* size. *
* *
* Note that the function RanrotAInit must be called before the first *
* call to RanrotA or iRanrotA *
* *
* The theory of the RANROT type of generators is described at *
* www.agner.org/random/ranrot.htm *
* *
*************************************************************************/

FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<<r)|(x>>(64-r));}

/* returns a random number of 64 bits unsigned */
FORCEINLINE BITBOARD RanrotA(void) {
/* generate next random number */
BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2);
/* rotate list pointers */
if( --r_p1 < 0)
r_p1 = KK - 1;
if( --r_p2 < 0 )
r_p2 = KK - 1;
return x;
}

/* this function initializes the random number generator. */
void RanrotAInit(void) {
int i;

/* one can fill the randbuffer here with possible other values here */
randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber;
randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12);

/* initialize pointers to circular buffer */
r_p1 = 0;
r_p2 = JJ;

/* randomize */
for( i = 0; i < 300; i++ )
(void)RanrotA();
}

/* Now the RASML code */
char *To64(BITBOARD x) {
static char buf[256];
char *sb;

sb = &buf[0];
#if UNIX
sprintf(buf,"%llu",x);
#else
sprintf(buf,"%I64u",x);
#endif
return sb;
}

int GetClock(void) {
/* The accuracy is measured in millisecondes. The used function is very accurate according
* to the NT team, way more accurate nowadays than mentionned in the MSDN manual. The accuracy
* for linux or unix we can only guess. Too many experts there.
*/
#if UNIX
struct timeval timeval;
struct timezone timezone;
gettimeofday(&timeval, &timezone);
return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
#else
return((int)GetTickCount());
#endif
}

by **Dann Corbit** » 26 Jul 2005, 01:15

diepeveen wrote:{snip}
More interesting is what Leen quotes, however without testing it,
is it going to run bugfree?
Code: Select all
int bitScan (const U64 bb) { const U64 lsb = (bb & -(long long) bb) - 1; const unsigned int foldedLSB = ((int) lsb) ^ ((int) (lsb >> 32)); return lsz64_tbl[foldedLSB * 0x78291ACF >> 26]; }

This is like 10 cycles when inlined, but why are the casts to signed int and signed long long needed?

The cast to signed long long is just to eliminate a compiler warning on some compilers along the lines of "negation of unsigned value is still unsigned" but it is harmless.

The two casts to int are actually not quite correct. They should be casts to unsigned int. I suspect that a dumb compiler could actually do an unnecessary operation because of it. They also ASSUME that int is 32 bits.

To make it fully portable {on systems where unsigned int is at least 32 bits but possibly 64 bits or more} we could do this:

Code: Select all: assert(UINT_MAX >= 0xffffffff); const unsigned int foldedLSB = (unsigned) ((lsb & 0xffffffff) ^ (lsb >> 32));

Now, if you are even more paranoid about a case where long long is bigger than 64 bits and also you may have bits above bit number 64 set (which really ruins the whole concept) then all bets are off.

But let me ask you this. All all processors you know and in the future, is this code going to run bugfree, yes also at highend processors?

Swear it on your mother!

Let's say with 64 bits integers...

Predicting the future is always dangerous. The standard can change over time. We should at least have a deprecation warning, of course.

by **Leen Ammeraal** » 26 Jul 2005, 07:41

Dann Corbit wrote:...
They also ASSUME that int is 32 bits.
...
We should at least have a deprecation warning, of course.
...

Hi Dann,
I like what your wrote about unsafe assumptions.
What about some simple code like that shown below?
I have not published my engine's code, so it is not necessary
to make it as compiler-independent as possible, but I would
very much appreciate getting an error message when
one day I switch to a compiler for which type char does
no longer consist of exactly 8 bits, or the types short
and int are no longer 2 and 4 bytes, respectively.
(Since, with VC++ 6.0, 64-bit integers are denoted
by type __int64, I did not include a test for 64-bit
ints in the code below.)

Code: Select all: #include <limits.h> #if (UCHAR_MAX != 0xFF) #error(UCHAR_MAX != 0xFF); #endif #if (USHRT_MAX != 0xFFFF) #error(USHRT_MAX != 0xFFFF); #endif #if (UINT_MAX != 0xFFFFFFFF) #error(UINT_MAX != 0xFFFFFFFF); #endif

Leen

by **Dann Corbit** » 26 Jul 2005, 18:16

Leen Ammeraal wrote:
Dann Corbit wrote:...
They also ASSUME that int is 32 bits.
...
We should at least have a deprecation warning, of course.
...

Hi Dann,
I like what your wrote about unsafe assumptions.
What about some simple code like that shown below?
I have not published my engine's code, so it is not necessary
to make it as compiler-independent as possible, but I would
very much appreciate getting an error message when
one day I switch to a compiler for which type char does
no longer consist of exactly 8 bits, or the types short
and int are no longer 2 and 4 bytes, respectively.
(Since, with VC++ 6.0, 64-bit integers are denoted
by type __int64, I did not include a test for 64-bit
ints in the code below.)

Code: Select all
#include <limits.h> #if (UCHAR_MAX != 0xFF) #error(UCHAR_MAX != 0xFF); #endif #if (USHRT_MAX != 0xFFFF) #error(USHRT_MAX != 0xFFFF); #endif #if (UINT_MAX != 0xFFFFFFFF) #error(UINT_MAX != 0xFFFFFFFF); #endif

Leen

If your code makes assumptions about the size of objects, then you must do something like that (or better yet, assertions). Defensive programming is a very, very good idea. And asserts and precompiler checks do not cost you anything in runtime efficiency.

There are systems where those assumptions will not hold true. At some point, I expect that int, and unsigned int will be 64 bits (which is actually what they should have been in the first place). It is probably better to use masks instead of assuming that the casts will trim off the unwanted bits. There is at least one conforming compiler that I know of that has a 32 bit char type (for a graphics chip).

I did not benchmark it yet, but I guess that these are just as fast as the versions that assume integral sizes, and they do not rely upon the assumption that int and unsigned are exactly 32 bits. They will also work for systems with 64 bit int and unsigned types. And if is is too small to hold the folded half of a 64 bit integer, you will get an assert when compiling in debug mode.

Code: Select all: #include <assert.h> #include <limits.h> /* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Matt Taylor's implementation of the de Bruijn bitscan: From a post on CCC: http://chessprogramming.org/cccsearch/ccc.php?art_id=306789 Subject : Re: There is huge potential to improve further Posted by : Matt Taylor on July 16, 2003 at 22:44:43 Which is indirectly a reference to this work: "Using de Bruijn Sequences to Index a 1 in a Computer Word" Charles E.Leiserso Harald Prokop Keith H, Randall MIT Laboratory for Computer Science, Cambridge, MA 02139, USA July 7, 1998 http://supertech.csail.mit.edu/papers/debruijn.ps The reset option is an obvious and useful addition implemented by Gerd Isenberg <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< */ const int lsz64_tbl[64] = { 0, 31, 4, 33, 60, 15, 12, 34, 61, 25, 51, 10, 56, 20, 22, 35, 62, 30, 3, 54, 52, 24, 42, 19, 57, 29, 2, 44, 47, 28, 1, 36, 63, 32, 59, 5, 6, 50, 55, 7, 16, 53, 13, 41, 8, 43, 46, 17, 26, 58, 49, 14, 11, 40, 9, 45, 21, 48, 39, 23, 18, 38, 37, 27, }; int bitScan(const bitboard & bb) { assert(UINT_MAX >= 0xffffffff); const bitboard lsb = (bb & -(long long) bb) - 1; const unsigned int foldedLSB = (unsigned) ((lsb & 0xffffffff) ^ (lsb >> 32)); return lsz64_tbl[foldedLSB * 0x78291ACF >> 26]; } unsigned int bitScanAndReset(bitboard & bb) { assert(UINT_MAX >= 0xffffffff); bitboard b = bb ^ (bb - 1); bb &= (bb - 1); unsigned int fold = (unsigned) ((b & 0xffffffff) ^ (b >> 32)); return lsz64_tbl[(fold * 0x78291ACF) >> (32 - 6)]; } /*popCount() *a noniterative population count of 1 bits in a quadword * *@param b - the quadword to be counted *@returns the number of 1 bits in b */ #define m1 0x5555555555555555ULL #define m2 0x3333333333333333ULL unsigned popCount(bitboard b) { unsigned n; const bitboard a = b - ((b >> 1) & m1); const bitboard c = (a & m2) + ((a >> 2) & m2); n = (unsigned) ((c & 0xffffffff) + (c >> 32)); n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); n = (n & 0xFFFF) + (n >> 16); n = (n & 0xFF) + (n >> 8); return n; }

For conforming compilers, you can also use integral types which have "hardwired" bit sizes.
http://www.opengroup.org/onlinepubs/009 ... int.h.html

by **Gerd Isenberg** » 29 Jul 2005, 20:39

Leen Ammeraal wrote:
Hi Dann,
I like what your wrote about unsafe assumptions.
What about some simple code like that shown below?
I have not published my engine's code, so it is not necessary
to make it as compiler-independent as possible, but I would
very much appreciate getting an error message when
one day I switch to a compiler for which type char does
no longer consist of exactly 8 bits, or the types short
and int are no longer 2 and 4 bytes, respectively.
(Since, with VC++ 6.0, 64-bit integers are denoted
by type __int64, I did not include a test for 64-bit
ints in the code below.)

Code: Select all
#include <limits.h> #if (UCHAR_MAX != 0xFF) #error(UCHAR_MAX != 0xFF); #endif #if (USHRT_MAX != 0xFFFF) #error(USHRT_MAX != 0xFFFF); #endif #if (UINT_MAX != 0xFFFFFFFF) #error(UINT_MAX != 0xFFFFFFFF); #endif

Leen

Hi Leen,

i also suggest to make sure two's complement works ;-)

#if (~(USHRT_MAX+UCHAR_MAX) != ~USHRT_MAX-UCHAR_MAX)
#error(no two's-complement);
#endif

Gerd

by **Leen Ammeraal** » 30 Jul 2005, 07:24

Gerd Isenberg wrote:
i also suggest to make sure two's complement works ;-)

#if (~(USHRT_MAX+UCHAR_MAX) != ~USHRT_MAX-UCHAR_MAX)
#error(no two's-complement);
#endif

Gerd

Thanks Gerd! As usual, you gave a very clever solution, which
takes me a lot of time to understand.
Would the following simpler solution not be just as good?

Code: Select all: #if ((-1 & 0xFFFF) != 0xFFFF) #error(no two's-complement); #endif

Remember, -1 is 111...111 in two's complement and
111...110 in one's complement. As you can see, the above
code tests whether the least significant 16 bits of -1 are all equal to 1.
Leen

Winboard Forum

Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

No bitscans

Re: No bitscans

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Re: Best BitBoard LSB funktion?

Who is online