Winboard Forum

by **Gerd Isenberg** » 27 Aug 2006, 20:47

Pradu wrote:I found a bug in the PEFRECT_HASH approach, fixing.

There are still bugs with perfect hash (inlined). Semicolon missing and my test-program indicates broken result. Btw. better use unsigned int square parameter for the getters, otherwise i have some movsx inside the 64-bit code...

by **diepeveen** » 27 Aug 2006, 21:00

Gerd Isenberg wrote:
Pradu wrote:I found a bug in the PEFRECT_HASH approach, fixing.

There are still bugs with perfect hash (inlined). Semicolon missing and my test-program indicates broken result. Btw. better use unsigned int square parameter for the getters, otherwise i have some movsx inside the 64-bit code...

Hi Gerd,

What compiler generates the MOVSX?

Just downloaded intel c++ 9.1 evaluation here now. Let's see how that one is performing onto some codes here and for Diep.

In general there is no difference between signed and unsigned integer speeds at all, despite some people past years reporting. Just a wrong way in how they compiled it usual.

Vincent

by **Gerd Isenberg** » 27 Aug 2006, 21:12

diepeveen wrote:Hi Gerd,

What compiler generates the MOVSX?

Just downloaded intel c++ 9.1 evaluation here now. Let's see how that one is performing onto some codes here and for Diep.

In general there is no difference between signed and unsigned integer speeds at all, despite some people past years reporting. Just a wrong way in how they compiled it usual.

Vincent

Hi Vincent,

i think compiler have to consider possible negative indices here.
The msvc2005 compiler has

Code: Select all: movsxd r8, edx

to get the square inside the 64-bit index register. Instead of

Code: Select all: mov r8d, edx

with unsigned. Not a big deal if copying the register anyway (which is not necessary, here btw.). But sign extension is simply not necessary.

Gerd

by **Pradu** » 27 Aug 2006, 21:45

Gerd Isenberg wrote:There are still bugs with perfect hash (inlined). Semicolon missing and my test-program indicates broken result. Btw. better use unsigned int square parameter for the getters, otherwise i have some movsx inside the 64-bit code...

Fixed now. It seems there were some typecasting problems in the initializer...

It is modified to to unsigned int now.

by **Gerd Isenberg** » 27 Aug 2006, 22:07

Pradu wrote:
Gerd Isenberg wrote:There are still bugs with perfect hash (inlined). Semicolon missing and my test-program indicates broken result. Btw. better use unsigned int square parameter for the getters, otherwise i have some movsx inside the 64-bit code...

Fixed now. It seems there were some typecasting problems in the initializer...

It is modified to to unsigned int now.

Ok works now - it is a lot slower in the snoob-test.
8,3 ns for bishop-attacks.
13,3 ns for rook-attacks.
16.6 ns for queen-attacks.

with unsigned i ment the Bmacig and Rmagic prototypes.

Gerd

by **Pradu** » 27 Aug 2006, 22:21

Gerd Isenberg wrote:Ok works now - it is a lot slower in the snoob-test.
8,3 ns for bishop-attacks.
13,3 ns for rook-attacks.
16.6 ns for queen-attacks.
with unsigned i ment the Bmacig and Rmagic prototypes.
Gerd

Slower? Bad luck

. Yes I changed the inlining prototypes.

by **Gerd Isenberg** » 27 Aug 2006, 22:34

Yes, but the nice thing thanks to your code is that you can try one of three options immediatly. PERFECT_HASH has four reads, but two expensive one which suffer from L1-cache pollution, while MINIMIZE_MAGIC has five reads, but only one expensive on the huge array.

by **Andreas Guettinger** » 28 Aug 2006, 00:27

Ok, I repeated the test with WAC 30, because I thought the starting position is probably not the best. This time I only compiled for 64-bit to save time.

Code: Select all: FEN: 1r3r2/4q1kp/b1pp2p1/5p2/pPn1N3/6P1/P3PPBP/2QRR1K1 w - - 0 1 (WAC 30) Power Mac G5 2.5 Ghz 64-bit: A B C Normal: 17.10 s 4.00 s 10.85 s Rotated: 13.64 s 3.14 s 13.04 s Magic: 10.53 s 3.08 s 9.87 s Minimize: 10.51 s 2.94 s 9.82 s Perfect: 10.80 s 3.07 s 10.01 s Explanations: A) Time to perft 5 of Anima in WAC 30 B) Time to gererate 80000000 moves C) Time to Generate+make+undo 80000000 moves

Depending on the position Pradus magic bitboards are often twice as fast as conventional bitboards and as fast as my rotated bitboards, plus the makemove is faster of the nonrotated or magic bitboard resulting in a net speed gain. The minimized and perfect made not a lot of difference here, maybe because Anima itself is not very optimized and rather inefficient.
I will test the same code on the athlon64.

by **Andreas Guettinger** » 28 Aug 2006, 01:04

Ok, the same on the amd64:

FEN: 1r3r2/4q1kp/b1pp2p1/5p2/pPn1N3/6P1/P3PPBP/2QRR1K1 w - - 0 1
(WAC 30)

Code: Select all: Athlon 64 3400+ 64-bit: A B C Normal: 13.70 s 3.93 s 6.97 s Rotated: 8.46 s 2.62 s 6.83 s Magic: 7.62 s 3.30 s 6.36 s Minimize: 8.25 s 2.87 s 6.07 s Perfect: 8.66 s 2.83 s 6.10 s

There is not much difference in makemove between rotated and nonrotated bitboards on the Athlon64. I'm a bit baffeled that doing

Code: Select all: #ifdef ROTATEBB ppos.occupied_rl90 ^= (set_mask_rl90[cfrom] | set_mask_rl90[cto]); ppos.occupied_rl45 ^= (set_mask_rl45[cfrom] | set_mask_rl45[cto]); ppos.occupied_rr45 ^= (set_mask_rr45[cfrom] | set_mask_rr45[cto]); #endif

two, three times in makemove leads to a speed loss on the powerpc...

by **Pradu** » 28 Aug 2006, 01:11

I've got a question Andreas. By generate moves do you mean generate the move bitboard, or generate the move bitboard and bitscan through it to make an array of moves? This might be why the timings are so close together -- bitscaning takes a good deal of time.

by **Gerd Isenberg** » 28 Aug 2006, 07:07

Pradu wrote:Nice to hear it is doing as good or better than the rotated approach in both 32-bits and 64-bits! Magic bitboard generators are also a lot more versitile than rotated bitboad generators. Take for example the issue of X-ray attacks.

Code: Select all
attacks=Rmoves(square,occupancy); xrayattacks=Rmoves(square,(occupancy&~attacks)) & ~attacks

It isnt possible to do something like this in rotated.

Good point. This an important feature of antirotated approaches, since we are independent on the current node structure and can pass any actual occupied sets to the attack getters.

Gerd

by **Tord Romstad** » 28 Aug 2006, 09:47

Andreas Guettinger wrote:
Code: Select all
#ifdef ROTATEBB ppos.occupied_rl90 ^= (set_mask_rl90[cfrom] | set_mask_rl90[cto]); ppos.occupied_rl45 ^= (set_mask_rl45[cfrom] | set_mask_rl45[cto]); ppos.occupied_rr45 ^= (set_mask_rr45[cfrom] | set_mask_rr45[cto]); #endif

The use of separate variables for occupied, occupied_rl90, occupied_rl45 and occupied_rr45 has always puzzled me. What is the advantage of using:

Code: Select all: bitboard_t occupied, occupied_rl90, occupied_rl45, occupied_rl45;

instead of the more natural:

Code: Select all: bitboard_t occupied[4];

?

Using a single array makes the code more compact and generic (especially when scanning for attacks in just a single direction, like when adding X-ray attacks in the static exchange evaluator), and based on my (frequently incorrect) intuition it should also be a bit faster. For instance, instead of your code:

Code: Select all: #ifdef ROTATEBB ppos.occupied_rl90 ^= (set_mask_rl90[cfrom] | set_mask_rl90[cto]); ppos.occupied_rl45 ^= (set_mask_rl45[cfrom] | set_mask_rl45[cto]); ppos.occupied_rr45 ^= (set_mask_rr45[cfrom] | set_mask_rr45[cto]); #endif

One could use:

Code: Select all: for(direction = 0; direction < 4; direction++) ppos.occupied[direction] ^= (set_mask[cfrom][direction] | set_mask[cto][direction]);

This looks faster to me, because setmask[cfrom][0], ..., setmask[cfrom][3] will occupy adjacent locations in memory, while set_mask_rl90[cfrom], ..., setmask_rr45[cfrom] will not.

Similarly (and probably more importantly), in many of the cases when you generate sliding attacks, you want to generate sliding attacks in more than one direction (e.g. when generating the moves for a sliding piece). With separate variables for each direction, you have to use something like:

Code: Select all: #define HorizontalAttacksBB(pos, sq) \ HorizontalAttacks[sq][(((pos)->occupied>>HorizontalShift[sq])&255)] #define VerticalAttacksBB(pos, sq) \ VerticalAttacks[sq][(((pos)->occupied_rl90>>VerticalShift[sq])&255)] #define DiagonalAttacksBB(pos, sq) \ DiagonalAttacks[sq][(((pos)->occupied_rl45>>DiagonalShift[sq])&255)] #define XDiagonalAttacksBB(pos, sq) \ XDiagonalAttacks[sq][(((pos)->occupied_rr45>>XDiagonalShift[sq])&255)] #define BishopAttackBB(pos, sq) \ (DiagonalAttacksBB(pos, sq) | XDiagonalAttacksBB(pos, sq)) #define RookAttackBB(pos, sq) \ (HorizontalAttacksBB(pos, sq) | VerticalAttacksBB(pos, sq)) #define QueenAttackBB(pos, sq) \ (BishopAttackBB(pos, sq) | RookAttackBB(pos, sq))

When using an array, you would instead have:

Code: Select all: bitboard_t SlidingAttacksBBArray[64][4][256]; uint8 Shift[64][4]; #define SlidingAttacksBB(pos, sq, direction) \ SlidingAttackBBArray[sq][direction][((pos)->occupied[direction]>>Shift[direction])&255] #define BishopAttackBB(pos, sq) \ (SlidingAttacksBB(pos, sq, DIAG_DIR) | SlidingAttacksBB(pos, sq, XDIAG_DIR)) #define RookAttackBB(pos, sq) \ (SlidingAttacksBB(pos, sq, HORIZ_DIR) | SlidingAttacksBB(pos, sq, VERT_DIR)) #define QueenAttackBB(pos, sq) \ (BishopAttackBB(pos, sq) | RookAttackBB(pos, sq))

Note that because the square index is located in the leftmost dimension, the sliding attacks in all directions from a single square are located close together in memory.

Obviously I must be missing something, because almost everybody else who uses rotated bitboards uses the clumsy-looking non-array approach. I would therefore appreciate if someone could explain to me: What's wrong with using a single occupied[4] instead of occupied, occupied_rl90, occupied_rl45 and occupied_rr45?

Tord

by **Andreas Guettinger** » 28 Aug 2006, 09:52

Pradu wrote:I've got a question Andreas. By generate moves do you mean generate the move bitboard, or generate the move bitboard and bitscan through it to make an array of moves? This might be why the timings are so close together -- bitscaning takes a good deal of time.

The latter, I call the move generator just x times in a row.

Ahem, I thought that is what you wanted?

by **Andreas Guettinger** » 28 Aug 2006, 10:05

Tord Romstad wrote:Obviously I must be missing something, because almost everybody else who uses rotated bitboards uses the clumsy-looking non-array approach. I would therefore appreciate if someone could explain to me: What's wrong with using a single occupied[4] instead of occupied, occupied_rl90, occupied_rl45 and occupied_rr45?

Tord

I suppose there is nothing wrong, it's just a matter of nomenclatur.
Like some prefer pieces[COLOR][n] and others whitepieces[n] and blackpieces[n]. I usually don't very much like array with more than one dimension, they give me the creeps.

I would be surprised if the speed difference is more than a mu. I always feared the more dimensions I add to an array the slower it get's, but I'm not a proffesional programmer, I'm a biologist.

Edit: There is another reason, a more historical one, I realized, at least im my case. As probably many others I wrote first a conventional bitboard move generator before the rotated one. That's why my code looks like that:

Code: Select all: ppos.occupied ^= (set_mask[cfrom] | set_mask[cto]); #ifdef ROTATEBB ppos.occupied_rl90 ^= (set_mask_rl90[cfrom] | set_mask_rl90[cto]); ppos.occupied_rl45 ^= (set_mask_rl45[cfrom] | set_mask_rl45[cto]); ppos.occupied_rr45 ^= (set_mask_rr45[cfrom] | set_mask_rr45[cto]); #endif

I never optimized that.

by **Zach Wegner** » 28 Aug 2006, 10:23

Tord Romstad wrote:Obviously I must be missing something, because almost everybody else who uses rotated bitboards uses the clumsy-looking non-array approach. I would therefore appreciate if someone could explain to me: What's wrong with using a single occupied[4] instead of occupied, occupied_rl90, occupied_rl45 and occupied_rr45?

Tord

Nothing, in fact, I'm surprised I've never seen it or thought of it before (I think). The only reason I can think of nobody using it is the fact that Crafty doesn't

. I no longer use rotated bitboards, but I can restructure my attack arrays like that. Thanks for the tip.

Zach

by **Gerd Isenberg** » 28 Aug 2006, 10:42

Tord Romstad wrote:Obviously I must be missing something, because almost everybody else who uses rotated bitboards uses the clumsy-looking non-array approach. I would therefore appreciate if someone could explain to me: What's wrong with using a single occupied[4] instead of occupied, occupied_rl90, occupied_rl45 and occupied_rr45?

Tord

Hi Tord,

the arrays are fine and more general and ensure that the members are adjacent in memory. Even with explicit unrolling, one may use constant direction- indizes. The loop needs of course one additional register - also the repetition condition is tiny overhead. That is, beside copying from crafty and elsewhere, the reason while most use explizit scalars with fixed semantics. Most compilers will unroll the loop anyway - so your method seems preferable and should not produce any overhead.

With antirotated approaches otoh you can avoid updating rotated at all, to decouple the sliding attack getter from the current node- or position structure ;-)

Gerd

by **Tord Romstad** » 28 Aug 2006, 12:40

Andreas Guettinger wrote:I suppose there is nothing wrong, it's just a matter of nomenclatur.
Like some prefer pieces[COLOR][n] and others whitepieces[n] and blackpieces[n]. I usually don't very much like array with more than one dimension, they give me the creeps.

For me, it's the other way around: I have nothing against multidimensional arrays, but I detest having separate variables and code chunks for white and black. More generally, I always try to avoid what I call "cut and paste code": Multiple code chunks which are almost identical, with just a few constants and variable names changed, and perhaps '>' replaced by '<' at some places. To me, such code is always a sure sign that I am missing some important abstraction.

I would be surprised if the speed difference is more than a mu.

I agree, it is certainly nothing big.

I always feared the more dimensions I add to an array the slower it get's,

I don't think this is necessarily the case, as long as everything except the leftmost dimension has a size equal to a power of 2, and the indices to the rightmost dimensions are the ones which change most frequently in the code. In other words,

Code: Select all: for(i = 0; i < m; i++) for(j = 0; j < n; j++) do_something(someArray[i][j]);

is probably faster than

Code: Select all: for(j = 0; j < n; j++) for(i = 0; i < m; i++) do_something(someArray[i][j]);

but I'm not a proffesional programmer, I'm a biologist.

I'm also not a programmer, but a mathematician (with a strong interest in biology, BTW), so your guess is probably as good as mine.

Admittedly I am a reasonably competent Lisp programmer, but as is often said: Lisp programmers know the value of everything, but the cost of nothing. :wink:

Tord

by **Tord Romstad** » 28 Aug 2006, 12:47

Hello Gerd,

Thanks for the comments!

Gerd Isenberg wrote:Hi Tord,

the arrays are fine and more general and ensure that the members are adjacent in memory. Even with explicit unrolling, one may use constant direction- indizes. The loop needs of course one additional register - also the repetition condition is tiny overhead. That is, beside copying from crafty and elsewhere, the reason while most use explizit scalars with fixed semantics. Most compilers will unroll the loop anyway - so your method seems preferable and should not produce any overhead.

That's what I thought, too. Because the loop bodies are usually tiny, the compiler will probably unroll almost all of them.

With antirotated approaches otoh you can avoid updating rotated at all, to decouple the sliding attack getter from the current node- or position structure

Yes, non-rotated approaches certainly have some rather enticing advantages. I'm using rotated bitboards now because they make it relatively easy to write compact and generic code, but they can often be frustratingly slow (at least on my 32-bit CPU). I should probably have a closer look at all the new non-rotated techniqes and see if I can find something I can use. The problem is that I want to use exactly the same code for each sliding direction, that I prefer not to use "magic" in any form, and that I don't want to have a single line of somebody else's code in my program.

Tord

by **Pradu** » 28 Aug 2006, 17:59

Tord Romstad wrote:Yes, non-rotated approaches certainly have some rather enticing advantages. I'm using rotated bitboards now because they make it relatively easy to write compact and generic code, but they can often be frustratingly slow (at least on my 32-bit CPU). I should probably have a closer look at all the new non-rotated techniqes and see if I can find something I can use. The problem is that I want to use exactly the same code for each sliding direction, that I prefer not to use "magic" in any form, and that I don't want to have a single line of somebody else's code in my program.

Tord

I posted a quick explanation on how magic move generation works here if for some reason you want to write it yourself:
http://www.vpittlik.org/wbforum/viewtopic.php?t=5256

by **Tord Romstad** » 28 Aug 2006, 20:11

Hello Pradu!

Pradu wrote:I posted a quick explanation on how magic move generation works here if for some reason you want to write it yourself:
http://www.vpittlik.org/wbforum/viewtopic.php?t=5256

Yes, I understand how the technique works. It is quite neat, and it is possible that I will be able to find a way to use it. I just have to somehow eliminate the "magic" part of it first. More specifically, I need to find a simple, straightforward and fast algorithm for computing all the required numbers; something that can be computed in a tiny fraction of a second during program initialisation. I can't use precomputed constant arrays containing numbers found by minutes or hours of computations and brute force search.

It looks like a non-trivial problem, but I hope it will turn out to be solvable. Too much else to do right now, though.

Tord

Winboard Forum

Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Who is online