Fastest Magic Move Bitboard Generator ready to use

Programming Topics (Computer Chess) and technical aspects as test techniques, book building, program tuning etc

Moderator: Andres Valverde

Fastest Magic Move Bitboard Generator ready to use

Postby Pradu » 25 Aug 2006, 16:37

Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Gerd Isenberg » 25 Aug 2006, 22:15

Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.


Thanks for the source, very convenient to try, if the bitboard mapping fits ;-)

Your approach reduces the size considerable - specially for the bishops. I don't like the additional indirection with the pointer array, plus variable shift array, so much. Whether the two additional accesses, even if L1, pay off, is the question. If you pick some part from your 64KByte L1 cache, from ~1MB or 4MB tables, probably doesn't matter so much. There are eventually L2/L3-cache and page issues. Guess Lasse's approach with larger dense, but homogenious arrays, and constant shift is faster.

A very quick, dump and "unfair" snooby-loop test, against my dense approach, where i only need 1.5KByte against your ~1MB:

Time1 is my 1.5KByte approach, time2 your one, all inlined, mvc2005:

Code: Select all
Rook:
time1 in sec =  3.266 runs 7624512*64 ~ 6.693 ns
time2 in sec =  2.922 runs 7624512*64 ~ 5.988 ns

Bishop:
time1 in sec =  3.703 runs 7624512*64 ~ 7.589 ns
time2 in sec =  2.031 runs 7624512*64 ~ 4.162 ns

Queen:
time1 in sec =  5.875 runs 7624512*64 ~12.040 ns
time2 in sec =  4.109 runs 7624512*64 ~ 8.421 ns


Code: Select all
int main(int argc, char* argv[])
{
   u64 atta=0, first = 0x1f, x, y, cnt;
   clock_t start, stop;

   init(); // my stuff
   initmagicmoves();
   start = clock();
   for (x = first, cnt=1; (y = snoob(x)) > x; x = y, cnt++)
   for (int sq = 0; sq < 64; sq++)
      atta ^= rookAttacks(y, (u32)sq);
   stop = clock();
   printf("time1 in sec = %6.3f runs %lld*64 ~%6.3f ns\n", (float)(stop-start)/CLOCKS_PER_SEC, cnt, ((float)(stop-start)/64*1000000.0) / cnt);   

   start = clock();
   for (x = first, cnt=1; (y = snoob(x)) > x; x = y, cnt++)
   for (int sq = 0; sq < 64; sq++)
      atta ^= PraduRookAttacks(y, sq);
   stop = clock();
   printf("time2 in sec = %6.3f runs %lld*64 ~%6.3f ns\n", (float)(stop-start)/CLOCKS_PER_SEC, cnt, ((float)(stop-start)/64*1000000.0) / cnt);   

   if ( atta != 0 ) {
      printf("oups, something is broken\n");
   }
   return 0;
}

Code: Select all
; Function compile flags: /Ogtpy
occ$ = 8
sq$ = 16
?PraduRookAttacks@@YA_K_KI@Z PROC         ; PraduRookAttacks, COMDAT
  00000   8b c2       mov    eax, edx
  00002   4c 8d 05 00 00
   00 00       lea    r8, OFFSET FLAT:__ImageBase
  00009   49 8b 94 c0 00
   00 00 00    mov    rdx, QWORD PTR ?magicmoves_r_mask@@3QB_KB[r8+rax*8]
  00011   48 23 d1    and    rdx, rcx
  00014   41 0f b6 8c 80
   00 00 00 00    movzx    ecx, BYTE PTR ?magicmoves_r_shift@@3QBHB[r8+rax*4]
  0001d   49 0f af 94 c0
   00 00 00 00    imul    rdx, QWORD PTR ?magicmoves_r_magics@@3QB_KB[r8+rax*8]
  00026   49 8b 84 c0 00
   00 00 00    mov    rax, QWORD PTR ?magicmoves_r_indecies@@3PAPEB_KA[r8+rax*8]
  0002e   48 d3 ea    shr    rdx, cl
  00031   48 8b 04 d0    mov    rax, QWORD PTR [rax+rdx*8]
  00035   c3       ret    0
?PraduRookAttacks@@YA_K_KI@Z ENDP         ; PraduRookAttacks

Code: Select all
; Function compile flags: /Ogtpy
occ$ = 16
sq$ = 24
?rookAttacks@@YA_K_KI@Z PROC            ; rookAttacks, COMDAT
  00000   40 53       push    rbx
  00002   44 8b c2    mov    r8d, edx
  00005   44 8b da    mov    r11d, edx
  00008   4c 8b d1    mov    r10, rcx
  0000b   8b ca       mov    ecx, edx
  0000d   49 8b d2    mov    rdx, r10
  00010   48 b8 01 01 01
   01 01 01 01 01    mov    rax, 0101010101010101H
  0001a   83 e1 07    and    ecx, 7
  0001d   41 83 e3 f8    and    r11d, -8
  00021   48 8d 1d 00 00
   00 00       lea    rbx, OFFSET FLAT:?firstRankAttacks@@3PAY07EA
  00028   48 d3 ea    shr    rdx, cl
  0002b   80 f1 07    xor    cl, 7
  0002e   48 23 d0    and    rdx, rax
  00031   48 b8 00 04 08
   10 20 40 80 00    mov    rax, 0080402010080400H
  0003b   48 0f af d0    imul    rdx, rax
  0003f   49 8b c0    mov    rax, r8
  00042   48 83 f0 38    xor    rax, 56
  00046   48 c1 e8 03    shr    rax, 3
  0004a   48 c1 ea 3a    shr    rdx, 58
  0004e   48 8d 04 d0    lea    rax, QWORD PTR [rax+rdx*8]
  00052   41 8a d0    mov    dl, r8b
  00055   44 0f b6 0c 18    movzx    r9d, BYTE PTR [rax+rbx]
  0005a   48 b8 01 02 04
   08 10 20 40 80    mov    rax, 8040201008040201H
  00064   83 e2 07    and    edx, 7
  00067   4c 0f af c8    imul    r9, rax
  0006b   48 b8 80 80 80
   80 80 80 80 80    mov    rax, 8080808080808080H
  00075   4c 23 c8    and    r9, rax
  00078   49 d3 e9    shr    r9, cl
  0007b   41 8d 4b 01    lea    ecx, DWORD PTR [r11+1]
  0007f   49 d3 ea    shr    r10, cl
  00082   41 8b cb    mov    ecx, r11d
  00085   41 83 e2 3f    and    r10d, 63
  00089   4e 8d 04 d2    lea    r8, QWORD PTR [rdx+r10*8]
  0008d   41 0f b6 04 18    movzx    eax, BYTE PTR [r8+rbx]
  00092   48 d3 e0    shl    rax, cl
  00095   49 0b c1    or    rax, r9
  00098   5b       pop    rbx
  00099   c3       ret    0
?rookAttacks@@YA_K_KI@Z ENDP            ; rookAttacks

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Uri Blass » 25 Aug 2006, 22:58

Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.


"You can use this to generate move bitboards for your own chess engine if you wish"

In other words people can start their own chess programs from a foreign move geneator.

I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.

If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"

After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.

Uri
User avatar
Uri Blass
 
Posts: 727
Joined: 09 Oct 2004, 05:59
Location: Tel-Aviv

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Uri Blass » 25 Aug 2006, 23:25

I downloaded the code and I cannot check the speed of it in a task like calculating perft.

I do not see code to initialize position based on fen or even only to initialize the opening position and I do not know if there is a code to generate knight or pawn moves(I did not try to understand the code but the word knight is not mentioned there)

It does not seem to me a move generator and maybe it is only a part of a move generator.

Note that I have not a bitboard program in the first place but
I was interested in checking speed.

If people need to add some other bitboard code in order to use your generator then they check the speed of a combination of your code with another code and not the speed of your code.

Uri
User avatar
Uri Blass
 
Posts: 727
Joined: 09 Oct 2004, 05:59
Location: Tel-Aviv

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Gerd Isenberg » 25 Aug 2006, 23:45

I agree that sharing ideas or pseudocode has another quality than sharing source code, conveniently to try. I also don't like the open source idea too much - and have problems with concrete frameworks, programs and even perft with unclear license conditions...

So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard. There are the classical rotated approaches, Lasse's approach and zillion others doing the same. All well known. I think this is still an atomic function, somehow on the same level like bitscan and popcount or Kogge-Stone fill routines. This is also a kind of competition to get the fastest sldiding attack getter, and the most dense tables...

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Pradu » 26 Aug 2006, 00:11

Gerd Isenberg wrote:Your approach reduces the size considerable - specially for the bishops. I don't like the additional indirection with the pointer array, plus variable shift array, so much. Whether the two additional accesses, even if L1, pay off, is the question. If you pick some part from your 64KByte L1 cache, from ~1MB or 4MB tables, probably doesn't matter so much. There are eventually L2/L3-cache and page issues. Guess Lasse's approach with larger dense, but homogenious arrays, and constant shift is faster.


Sorry Gerd, I do not understand the assembler that well; but I've taken your written suggestions and produced another version of magicmoves (currently uploaded). In this one you can define MINIMIZE_MAGIC in the header file and produce 841kb databases. Otherwise 2304kb databases will be used. By the way, did you use a 64 bit compile or a 32 bit compile for your tests? It would be nice to see a comparison between these and between MINIMIZE_MAGIC and no MINIMIZE_MAGIC and to see if the increase (mabe even decrease, I'm not sure) in speed is truly worth the extra size.

Gerd Isenberg wrote:So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard.


And it intends to remain that way :mrgreen: .

Uri wrote:It does not seem to me a move generator and maybe it is only a part of a move generator.
It's the main (and common) part of the move generator for bitboards. It isn't actually the move generator.

Tests in a complete bitboard engine would be better than perft I guess.
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Dann Corbit » 26 Aug 2006, 03:15

Uri Blass wrote:
Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.


"You can use this to generate move bitboards for your own chess engine if you wish"

In other words people can start their own chess programs from a foreign move geneator.

I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.

If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"

After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.

Uri


But where did we get alpha-beta?

How about pvs? The history reductions idea from Fruit and Glaurung?

I don't see any harm in it. But I also see the other side where some people get really mad about it (e.g. Smirf author breathes fire about showing other people your code).

There seems to be a groundswell in computer chess that you have to write everything from scratch, but we always borrow ideas at least.

I also agree that there is greater value in reading someone's paper and/or code and understanding the idea then implementing that idea instead of using their code. But some bits of writing a chess engine (e.g the Winboard interface or UCI part) are just tiring crap, that a patient monkey could finish. Now, there are grey lines where if you borrow too much stuff, the engine is not your own.

But can you say that every idea in your engine is totally original? Anyone who says that is not truthful. Or, if they are truthful, they will have a crappy engine unless they are Knuth or Sedgewick or some other freak of nature.

IMO-YMMV.
Dann Corbit
 

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Dann Corbit » 26 Aug 2006, 03:19

Gerd Isenberg wrote:I agree that sharing ideas or pseudocode has another quality than sharing source code, conveniently to try. I also don't like the open source idea too much - and have problems with concrete frameworks, programs and even perft with unclear license conditions...

So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard. There are the classical rotated approaches, Lasse's approach and zillion others doing the same. All well known. I think this is still an atomic function, somehow on the same level like bitscan and popcount or Kogge-Stone fill routines. This is also a kind of competition to get the fastest sldiding attack getter, and the most dense tables...

Gerd


If we stop sharing ideas, then the competition has become too intense to be of the great value that gentle competition has.

Suppose that Wirth had hidden the ideas for quicksort, or Shannon never told us about his chess ideas. Eventually, we will probably discover them anyway, but it will retard us 20 years, especially if everyone else is doing the same thing. Is it really worth it to write a faster program than someone else and set the computer science field back so that we can cross the finish line ahead of the others?

Just a viewpoint.

P.S.
You are a person who openly shares his ideas all the time, so no complaints here. I am just shouting into the air.
Dann Corbit
 

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Pradu » 26 Aug 2006, 06:40

I've uploaded another modification to magicmoves. This one will express the macros as inlined functions if USE_INLINING is defined in the header file. Still looking forward to performances vs other move generators (and versus variations of itself) on 64-bit computers with 32/64 bit compiles. 8-)
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Uri Blass » 26 Aug 2006, 08:40

Dann Corbit wrote:
Uri Blass wrote:
Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.


"You can use this to generate move bitboards for your own chess engine if you wish"

In other words people can start their own chess programs from a foreign move geneator.

I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.

If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"

After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.

Uri


But where did we get alpha-beta?

How about pvs? The history reductions idea from Fruit and Glaurung?

I don't see any harm in it. But I also see the other side where some people get really mad about it (e.g. Smirf author breathes fire about showing other people your code).

There seems to be a groundswell in computer chess that you have to write everything from scratch, but we always borrow ideas at least.

I also agree that there is greater value in reading someone's paper and/or code and understanding the idea then implementing that idea instead of using their code. But some bits of writing a chess engine (e.g the Winboard interface or UCI part) are just tiring crap, that a patient monkey could finish. Now, there are grey lines where if you borrow too much stuff, the engine is not your own.

But can you say that every idea in your engine is totally original? Anyone who says that is not truthful. Or, if they are truthful, they will have a crappy engine unless they are Knuth or Sedgewick or some other freak of nature.

IMO-YMMV.


sharing idea is not a problem and the problem is sharing chess playing code except code that is about nalimov tablebases.

In the case of history reductions
I had history reductions based on history counters before fruit and glaurung.

I later added pruning based on ideas in fruit code but I had bugs in my code and I am not sure if this made movei better.

later I decided to delete history counters because it made the code too complicated so today I have only late move reductions that are not based on history counters even if it did not make movei stronger.

I agree that not everything should be written from scratch
I think that there is no problem with using code that is not chess code(like winboard interface code or generating random numbers)

I also think that it may be possible to use some small functions
that Gerd posted in the past but full move generator seems to me something that should not be used as copy and paste in another engine and I initially did not understand that the code that was posted is not close to be full move generator and the title gave me wrong impression.

Uri
User avatar
Uri Blass
 
Posts: 727
Joined: 09 Oct 2004, 05:59
Location: Tel-Aviv

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Pradu » 26 Aug 2006, 08:49

Uri wrote:the title gave me wrong impression


Yes, I guess Magic _Move Bitboard_ Generator and Magic _Bitboard Move_ Generator could be a little confusing. :)
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Gerd Isenberg » 26 Aug 2006, 09:29

Pradu wrote:
Sorry Gerd, I do not understand the assembler that well; but I've taken your written suggestions and produced another version of magicmoves (currently uploaded). In this one you can define MINIMIZE_MAGIC in the header file and produce 841kb databases. Otherwise 2304kb databases will be used. By the way, did you use a 64 bit compile or a 32 bit compile for your tests? It would be nice to see a comparison between these and between MINIMIZE_MAGIC and no MINIMIZE_MAGIC and to see if the increase (mabe even decrease, I'm not sure) in speed is truly worth the extra size.


Ok, this is 64-bit assembly, generated by msvc2005 c++ compiler. I think 32-bit mode doesn't make much sense with 64-bit multiplication - a call with three 32-bit imuls. For 32-bit mode one has to find some folding tricks to get the occupied state by mul * magig32 >> shift32.

If you see some x86 assembly with r8..r15 or r8d...r15d and rxx instead of exx, it is 64-bit mode. All address or index registers are the 64-bit registers, while for arithmetical/logical operations the target register may still be the 32-bit registers, since 32-bit (int) is still the default openrand size.

In 64-bit mode, function parameter (up to three for msvc) are passed via registers, rcx, rdx, r8? or something (fastcall). Return is per convention via rax (64 bit) or eax (32-bit) (even byte/char return is possible via partial low byte register al).

Your rook getter is called with occupied in register rcx and square in edx (32-bit is fine here). I'll give some further remarks on the assembly of your rook getter:

Code: Select all
   mov    eax, edx
; copy square from edx into register eax
; a 32-bit move or operation on a 32-bit regsister implicitly clears the
; upper 32-bit of the "shadowed" 64-bit register rax, which is further used
; to index the arrays with various scaling (rax*4, rax*8) for int or U64 arrays.


   lea    r8, OFFSET FLAT:__ImageBase
; hmm..., in 64-bit mode we need a base register to access globals!

   mov    rdx, QWORD PTR ?magicmoves_r_mask@@3QB_KB[r8+rax*8]
; get the magicmoves_r_mask[BASE + 8*sq] into rdx
; the index scaling rax*8 is required because we need to index 8-byte items


   and    rdx, rcx
; rdx := rdx & rcx -> mask := mask & occupied

   movzx    ecx, BYTE PTR ?magicmoves_r_shift@@3QBHB[r8+rax*4]
; get the shift amount, magicmoves_r_shift[BASE + 4*sq]
; here the scaling is 4 due to 4-byte integer size
; the strange thing is that the compiler takes a byte-move with zero extension to
; 32-bit (movzx), rather than the shorter and faster mov ecx, DWORD PTR ..
; variable shift on x86 is only possible with the byte register cl, which
; is lower part of ecx or even rcx.

   imul    rdx, QWORD PTR ?magicmoves_r_magics@@3QB_KB[r8+rax*8]
; the 64-bit multiplication rdx := rdx * memory-operand
; the masked occupied in rdx is multiplied with
; magicmoves_r_magics[BASE + 8*square]. The product is stored in
; source1/target rdx-register again, so that the initial mask, no longer needed,
; is overwritten.

   mov    rax, QWORD PTR ?magicmoves_r_indecies@@3PAPEB_KA[r8+rax*8]
; get the pointer to the array base into by 8-byte scaled square (rax*8) into rax,
; so square formerly in rax is overwitten

   shr    rdx, cl
; the 64-bit right shift of the product, to get the occupied index

   mov    rax, QWORD PTR [rax+rdx*8]
; get the final attack-set *(ptr + occpied), again scaling of 8

   ret    0
; return to the caller

I will report on not MINIMIZE_MAGIC later...

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Gerd Isenberg » 26 Aug 2006, 16:13

Congrats Pradu, seems your MINIMIZE_MAGIC is really a lot faster, not to say fastest, specially for the rooks in this snooby framework.
Puhh - i was dead wrong with my guess the homogenious arrays may faster, as the pure assembly below might suggest.
I did not expect that extreme slowdown with the huge arrays. They totally kill the rook/queen performance - at least with this unrealistic occupied pattern, snoobing 5-bit around and traversing squares from 0..63. Of course, exchanging square and occupied-state dimensions in the databases would leave a much better cache performance here, since sq is traversed in the inner loop. So i tried to relax the square variance by "anding" the square parameter of the attack-getter with 1,3,7,15,and 31 to see how memory behaves. For the sake of completeness i missused kogge-stone attacks (gp-registers) with additional 1<<sq:

All times in ns per call (+overhead), this is an AMD 64X2 4800+ 2.4GHz:

Code: Select all
Rook attack        sq&1   sq&3   sq&7   sq&15 sq&31   sq
KoggeStone        15.177 15.177 15.177 15.177 15.177 15.177
DenseFirstRank     6.660  7.269  7.046  7.078  7.046  6.660
!MINIMIZE_MAGIC    4.740  5.476  6.115  6.498  8.068 18.700
 MINIMIZE_MAGIC    4.420  4.867  5.379  5.699  5.730  6.084   
Bishop attack
KoggeStone        16.460 16.460 16.460 16.460 16.460 16.460
DenseFirstRank     7.908  7.429  7.398  7.429  7.396  7.589
!MINIMIZE_MAGIC    4.740  4.547  4.482  4.484  4.996  5.252
 MINIMIZE_MAGIC    4.420  4.484  4.451  4.388  4.388  4.162
Queen attack
KoggeStone        30.133 30.133 30.133 30.133 30.133 30.133
DenseFirstRank    11.527 10.566 10.952 10.952 10.952 11.945
!MINIMIZE_MAGIC    6.437  8.390  7.972  8.837 12.745 32.148
 MINIMIZE_MAGIC    6.084  6.660  7.365  7.173  8.101  8.486


Again, in a concrete chess program with real positions, things may behave completely different... But it somehow demonstrates, how volatile huge memory may behave, and what may happen if there are a lot of other memory traffic around and L1 starts to pollute...

Here the generated 64-bit assembly of Lasse's rook-getter with three instead of five memory accesses. The shl 12 is necessary to scale the square index since scale factors in the address operand are only possible with *1,2,4,8.

Code: Select all
occ$ = 8
sq$ = 16
?LasseRookAttacks@@YA_K_KI@Z PROC         ; LasseRookAttacks, COMDAT
  00000   44 8b c2               mov    r8d, edx
  00003   48 8d 15 00 00 00 00      lea    rdx, OFFSET FLAT:__ImageBase
  0000a   4a 8b 84 c2 00 00 00 00      mov    rax, QWORD PTR ?magicmoves_r_mask@@3QB_KB[rdx+r8*8]
  00012   48 23 c1               and    rax, rcx
  00015   4a 0f af 84 c2 00 00 00 00   imul    rax, QWORD PTR ?magicmoves_r_magics@@3QB_KB[rdx+r8*8]
  0001e   49 c1 e0 0c                shl    r8, 12
  00022   48 c1 e8 34                shr    rax, 52
  00026   49 03 c0                add    rax, r8
  00029   48 8b 84 c2 00 00 00 00      mov    rax, QWORD PTR ?magicmovesrdb@@3PAY0BAAA@_KA[rdx+rax*8]
  00031   c3                     ret    0
?LasseRookAttacks@@YA_K_KI@Z ENDP         ; LasseRookAttacks

This is the current snooby, which indicates that 9-cycle vector path bsf is slightly superior than deBruijn mul with lookup - less register pressure and of course shorter code.
Code: Select all
#include <intrin.h>

u64 snoob (u64 x) {
   unsigned long lidx;
   _BitScanForward64(&lidx, x);
   u64 ls1b = x & -(i64)x;
   u64 ripp = x + ls1b;
   u64 ones = x ^ ripp;
   u64 sone = ((ones >> 2) >> lidx);
   return ripp + sone;
}

x$ = 8
?snoob@@YA_K_K@Z PROC      ; snoob, COMDAT
  00000   4c 8b c1    mov    r8, rcx
  00003   48 0f bc d1    bsf    rdx, rcx
  00007   49 f7 d8    neg    r8
  0000a   4c 23 c1    and    r8, rcx
  0000d   4c 03 c1    add    r8, rcx
  00010   49 8b c0    mov    rax, r8
  00013   48 33 c1    xor    rax, rcx
  00016   8b ca       mov    ecx, edx
  00018   48 c1 e8 02    shr    rax, 2
  0001c   48 d3 e8    shr    rax, cl
  0001f   49 03 c0    add    rax, r8
  00022   c3       ret    0
?snoob@@YA_K_K@Z ENDP      ; snoob

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Pradu » 26 Aug 2006, 18:40

Very interesting; mabe a perfect hash by indirection will be even faster! The perfect hash will take ~50kb for the moves and the indexing table will take ~100kb with chars, ~200kb with shorts, ~400 kb using ints for MINIMIZE_MAGIC or ~250kb with chars, ~500kb with shorts, ~1 MB using ints for no MINIMIZE_MAGIC. So a minimal move database (rookMoves[4900] bishopMoves[1428] can be made into as small as 150kb. I'll start implementing this and lets see if it turns out to be even faster. But I guess we cannot infer performance in a real engine by just using this test. Nevertheless, I'll leave all possibilities open in magicmoves.
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Gerd Isenberg » 26 Aug 2006, 20:57

Pradu wrote:Very interesting; mabe a perfect hash by indirection will be even faster! The perfect hash will take ~50kb for the moves and the indexing table will take ~100kb with chars, ~200kb with shorts, ~400 kb using ints for MINIMIZE_MAGIC or ~250kb with chars, ~500kb with shorts, ~1 MB using ints for no MINIMIZE_MAGIC. So a minimal move database (rookMoves[4900] bishopMoves[1428] can be made into as small as 150kb. I'll start implementing this and lets see if it turns out to be even faster. But I guess we cannot infer performance in a real engine by just using this test. Nevertheless, I'll leave all possibilities open in magicmoves.

Yes, that might be interesting and even required to eventually index precalculated move-lists. I mean not passing occupied boards, but rook/bishop attack (sub)-sets to safe bitscan loops. But probably that becomes easier with disjoint file, rank and diagonals directions. What about hashing pawn push and pawn capture targets of a pair of two ranks in one run? And the up to eight knight and king targets? To make a complete branch- and loopless move-generator? ;-)
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Pradu » 27 Aug 2006, 02:32

Gerd Isenberg wrote:Yes, that might be interesting and even required to eventually index precalculated move-lists. I mean not passing occupied boards, but rook/bishop attack (sub)-sets to safe bitscan loops. But probably that becomes easier with disjoint file, rank and diagonals directions. What about hashing pawn push and pawn capture targets of a pair of two ranks in one run? And the up to eight knight and king targets? To make a complete branch- and loopless move-generator? ;-)


It will take up a good deal of memory for a branch- and loopless move-generator :). Anyways char's can't be used as indecies, too small :(. I'm using shorts. I've posted a perfect-hashing database generator but it dosen't work in conjunction with MINIMIZE_MAGICS yet so it is taking up ~550k. In this one you must have MINIMIZE_MAGICS commented out and the following defined in the header file:
#define PERFECT_MAGIC_HASH unsigned short

I'll work on compressing the indexing database to bring the total size of the databases down to ~250k.
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Andreas Guettinger » 27 Aug 2006, 12:56

Ok, I tried this out in my bitboard move generator. I didn't try the MINIMIZE_MAGIC part because I don't understand what it does.

Test computer was a Apple Power Mac G5 2.5 Ghz.
32-bit compile: -Wall -O2
64-bit compile: -Wall -O2 -mcpu=G5 -mtune=G5 -fast
Test engine was AnimaX.

Code: Select all
32-bit:
-------
              A           B           C
            
Normal:     34.63 s     7.63 s      10.91 s         
Rotated:    20.42 s     3.72 s       9.63 s
Magic:      16.35 s     3.48 s       7.02 s


64-bit:
-------
               A          B           C
            
Normal:     19.87 s     4.41 s       7.23 s         
Rotated:    14.03 s     2.47 s       6.65 s
Magic:      11.25 s     2.46 s       5.25 s

Explanations:

A) Time to perft 6 of Anima
B) Time to gererate 80000000 moves
C) Time to Generate+make+undo 80000000 moves


Interestingly the speed of the magicmoves generator on my machine was about the same than my rotated bitboard move generator, but the makemove undomove was much faster do to not having to update the rotated occupied squares bitmaps. All in all for perft there was a considerable speed gain.

Note that the perft function of Anima doesn't use any tricks or hashtables and does all the make/undo moves even at the leave nodes. It also generates captures and noncaptures seperately for every node and concatenates the two move lists together, in short, it's not designed for speed.
regards,
Andy
Andreas Guettinger
 
Posts: 20
Joined: 26 Aug 2006, 23:37
Location: Bern, Switzerland

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Gerd Isenberg » 27 Aug 2006, 14:25

Andreas Guettinger wrote:Ok, I tried this out in my bitboard move generator. I didn't try the MINIMIZE_MAGIC part because I don't understand what it does.

You don't have this in the first line of magicmoves.h?
Code: Select all
#define MINIMIZE_MAGIG

Then you didn't use the dense 841Kb lookup but the huge 2304kb lookup? MINIMIZE_MAGIC uses almost different factors and shift-amount per square.

Not defining MINIMIZE_MAGIC is like Lasse Hansen's approach, each square, to get a 12 bit address for all possible (12..14) rook-occupied states:
Code: Select all
pattern := occupied & rookmask [sq]
address := (pattern * rookMagig[sq]) >> (64-12)
attacks := rookAttacks[sq][address]

or with a one-dimensional array
Code: Select all
u64 rookAttacks[64*4096]
sqBase  := sq<<12
attacks := rookAttacks[sqBase + address]

Pradu took advantage of the fact, that a lot of squares require less than 12-bit adressess to get rook attacks for all relevant occupied bits and was able to dense the 2MB to 800KByte for rooks, and each square-lookup has a different shift amount and an own slot with almost different size in the one-dimensional array:
Code: Select all
pattern := occupied & rookmask [sq];
address := (pattern * rookMagig[sq]) >> rookShift[sq] // 9..12 bit address
sqBase  := rooksOffset[sq]  // was sq<<12 on Lasse's approach
attacks := rookAttacks[sqBase + address]
I was skeptical in the beginning that this two additional memory accesses, to dense 2MByte to 800KByte for rooks or 256KByte to 41KByte for bishops pays off - but it seems i was wrong as my dumb test-framework suggests. I guess also tlb pollution becomes an issue with a 2MB table - tlb maps virtual addresses of 4KByte pages to physical addresses. So if you simply #define MINIMIZE_MAGIC in the header file and test again...

Gerd
Gerd Isenberg
 
Posts: 285
Joined: 31 Jan 2005, 20:31
Location: Hattingen, Germany

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Andreas Guettinger » 27 Aug 2006, 15:23

Gerd Isenberg wrote:So if you simply #define MINIMIZE_MAGIC in the header file and test again...

Gerd


Ahem, I didn't get the point that to uncomment the define is enough. Anyway, I uncommented #define MINIMIZE_MAGIC, make clean, and make again but the result didn't change much. Maybe in a full search there is a difference.

Code: Select all
32-bit:
-------
              A           B           C
            
Normal:     34.63 s     7.63 s      10.91 s         
Rotated:    20.42 s     3.72 s       9.63 s
Magic:      16.35 s     3.48 s       7.02 s
Minimize:   16.86 s     3.61 s       7.09 s

64-bit:
-------
               A          B           C
            
Normal:     19.87 s     4.41 s       7.23 s         
Rotated:    14.03 s     2.47 s       6.65 s
Magic:      11.25 s     2.46 s       5.25 s
Minimize:   11.27 s     2.43 s       5.25 s


Explanations:

A) Time to perft 6 of Anima
B) Time to gererate 80000000 moves
C) Time to Generate+make+undo 80000000 moves
regards,
Andy
Andreas Guettinger
 
Posts: 20
Joined: 26 Aug 2006, 23:37
Location: Bern, Switzerland

Re: Fastest Magic Move Bitboard Generator ready to use

Postby Pradu » 27 Aug 2006, 19:23

Nice to hear it is doing as good or better than the rotated approach in both 32-bits and 64-bits! Magic bitboard generators are also a lot more versitile than rotated bitboad generators. Take for example the issue of X-ray attacks.

Code: Select all
attacks=Rmoves(square,occupancy);
xrayattacks=Rmoves(square,(occupancy&~attacks)) & ~attacks


It isnt possible to do something like this in rotated.

I found a bug in the PEFRECT_HASH approach, fixing.
User avatar
Pradu
 
Posts: 343
Joined: 12 Jan 2005, 19:17
Location: Chandler, Arizona, USA

Next

Return to Programming and Technical Discussions

Who is online

Users browsing this forum: No registered users and 6 guests