Winboard Forum

by **Pradu** » 25 Aug 2006, 16:37

Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.

by **Gerd Isenberg** » 25 Aug 2006, 22:15

Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.

Thanks for the source, very convenient to try, if the bitboard mapping fits ;-)

Your approach reduces the size considerable - specially for the bishops. I don't like the additional indirection with the pointer array, plus variable shift array, so much. Whether the two additional accesses, even if L1, pay off, is the question. If you pick some part from your 64KByte L1 cache, from ~1MB or 4MB tables, probably doesn't matter so much. There are eventually L2/L3-cache and page issues. Guess Lasse's approach with larger dense, but homogenious arrays, and constant shift is faster.

A very quick, dump and "unfair" snooby-loop test, against my dense approach, where i only need 1.5KByte against your ~1MB:

Time1 is my 1.5KByte approach, time2 your one, all inlined, mvc2005:

Code: Select all: Rook: time1 in sec = 3.266 runs 7624512*64 ~ 6.693 ns time2 in sec = 2.922 runs 7624512*64 ~ 5.988 ns Bishop: time1 in sec = 3.703 runs 7624512*64 ~ 7.589 ns time2 in sec = 2.031 runs 7624512*64 ~ 4.162 ns Queen: time1 in sec = 5.875 runs 7624512*64 ~12.040 ns time2 in sec = 4.109 runs 7624512*64 ~ 8.421 ns

Code: Select all: int main(int argc, char* argv[]) { u64 atta=0, first = 0x1f, x, y, cnt; clock_t start, stop; init(); // my stuff initmagicmoves(); start = clock(); for (x = first, cnt=1; (y = snoob(x)) > x; x = y, cnt++) for (int sq = 0; sq < 64; sq++) atta ^= rookAttacks(y, (u32)sq); stop = clock(); printf("time1 in sec = %6.3f runs %lld*64 ~%6.3f ns\n", (float)(stop-start)/CLOCKS_PER_SEC, cnt, ((float)(stop-start)/64*1000000.0) / cnt); start = clock(); for (x = first, cnt=1; (y = snoob(x)) > x; x = y, cnt++) for (int sq = 0; sq < 64; sq++) atta ^= PraduRookAttacks(y, sq); stop = clock(); printf("time2 in sec = %6.3f runs %lld*64 ~%6.3f ns\n", (float)(stop-start)/CLOCKS_PER_SEC, cnt, ((float)(stop-start)/64*1000000.0) / cnt); if ( atta != 0 ) { printf("oups, something is broken\n"); } return 0; }

Code: Select all: ; Function compile flags: /Ogtpy occ$ = 8 sq$ = 16 ?PraduRookAttacks@@YA_K_KI@Z PROC ; PraduRookAttacks, COMDAT 00000 8b c2 mov eax, edx 00002 4c 8d 05 00 00 00 00 lea r8, OFFSET FLAT:__ImageBase 00009 49 8b 94 c0 00 00 00 00 mov rdx, QWORD PTR ?magicmoves_r_mask@@3QB_KB[r8+rax*8] 00011 48 23 d1 and rdx, rcx 00014 41 0f b6 8c 80 00 00 00 00 movzx ecx, BYTE PTR ?magicmoves_r_shift@@3QBHB[r8+rax*4] 0001d 49 0f af 94 c0 00 00 00 00 imul rdx, QWORD PTR ?magicmoves_r_magics@@3QB_KB[r8+rax*8] 00026 49 8b 84 c0 00 00 00 00 mov rax, QWORD PTR ?magicmoves_r_indecies@@3PAPEB_KA[r8+rax*8] 0002e 48 d3 ea shr rdx, cl 00031 48 8b 04 d0 mov rax, QWORD PTR [rax+rdx*8] 00035 c3 ret 0 ?PraduRookAttacks@@YA_K_KI@Z ENDP ; PraduRookAttacks

Code: Select all: ; Function compile flags: /Ogtpy occ$ = 16 sq$ = 24 ?rookAttacks@@YA_K_KI@Z PROC ; rookAttacks, COMDAT 00000 40 53 push rbx 00002 44 8b c2 mov r8d, edx 00005 44 8b da mov r11d, edx 00008 4c 8b d1 mov r10, rcx 0000b 8b ca mov ecx, edx 0000d 49 8b d2 mov rdx, r10 00010 48 b8 01 01 01 01 01 01 01 01 mov rax, 0101010101010101H 0001a 83 e1 07 and ecx, 7 0001d 41 83 e3 f8 and r11d, -8 00021 48 8d 1d 00 00 00 00 lea rbx, OFFSET FLAT:?firstRankAttacks@@3PAY07EA 00028 48 d3 ea shr rdx, cl 0002b 80 f1 07 xor cl, 7 0002e 48 23 d0 and rdx, rax 00031 48 b8 00 04 08 10 20 40 80 00 mov rax, 0080402010080400H 0003b 48 0f af d0 imul rdx, rax 0003f 49 8b c0 mov rax, r8 00042 48 83 f0 38 xor rax, 56 00046 48 c1 e8 03 shr rax, 3 0004a 48 c1 ea 3a shr rdx, 58 0004e 48 8d 04 d0 lea rax, QWORD PTR [rax+rdx*8] 00052 41 8a d0 mov dl, r8b 00055 44 0f b6 0c 18 movzx r9d, BYTE PTR [rax+rbx] 0005a 48 b8 01 02 04 08 10 20 40 80 mov rax, 8040201008040201H 00064 83 e2 07 and edx, 7 00067 4c 0f af c8 imul r9, rax 0006b 48 b8 80 80 80 80 80 80 80 80 mov rax, 8080808080808080H 00075 4c 23 c8 and r9, rax 00078 49 d3 e9 shr r9, cl 0007b 41 8d 4b 01 lea ecx, DWORD PTR [r11+1] 0007f 49 d3 ea shr r10, cl 00082 41 8b cb mov ecx, r11d 00085 41 83 e2 3f and r10d, 63 00089 4e 8d 04 d2 lea r8, QWORD PTR [rdx+r10*8] 0008d 41 0f b6 04 18 movzx eax, BYTE PTR [r8+rbx] 00092 48 d3 e0 shl rax, cl 00095 49 0b c1 or rax, r9 00098 5b pop rbx 00099 c3 ret 0 ?rookAttacks@@YA_K_KI@Z ENDP ; rookAttacks

Gerd

by **Uri Blass** » 25 Aug 2006, 22:58

Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.

"You can use this to generate move bitboards for your own chess engine if you wish"

In other words people can start their own chess programs from a foreign move geneator.

I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.

If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"

After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.

Uri

by **Uri Blass** » 25 Aug 2006, 23:25

I downloaded the code and I cannot check the speed of it in a task like calculating perft.

I do not see code to initialize position based on fen or even only to initialize the opening position and I do not know if there is a code to generate knight or pawn moves(I did not try to understand the code but the word knight is not mentioned there)

It does not seem to me a move generator and maybe it is only a part of a move generator.

Note that I have not a bitboard program in the first place but
I was interested in checking speed.

If people need to add some other bitboard code in order to use your generator then they check the speed of a combination of your code with another code and not the speed of your code.

Uri

by **Gerd Isenberg** » 25 Aug 2006, 23:45

I agree that sharing ideas or pseudocode has another quality than sharing source code, conveniently to try. I also don't like the open source idea too much - and have problems with concrete frameworks, programs and even perft with unclear license conditions...

So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard. There are the classical rotated approaches, Lasse's approach and zillion others doing the same. All well known. I think this is still an atomic function, somehow on the same level like bitscan and popcount or Kogge-Stone fill routines. This is also a kind of competition to get the fastest sldiding attack getter, and the most dense tables...

Gerd

by **Pradu** » 26 Aug 2006, 00:11

Gerd Isenberg wrote:Your approach reduces the size considerable - specially for the bishops. I don't like the additional indirection with the pointer array, plus variable shift array, so much. Whether the two additional accesses, even if L1, pay off, is the question. If you pick some part from your 64KByte L1 cache, from ~1MB or 4MB tables, probably doesn't matter so much. There are eventually L2/L3-cache and page issues. Guess Lasse's approach with larger dense, but homogenious arrays, and constant shift is faster.

Sorry Gerd, I do not understand the assembler that well; but I've taken your written suggestions and produced another version of magicmoves (currently uploaded). In this one you can define MINIMIZE_MAGIC in the header file and produce 841kb databases. Otherwise 2304kb databases will be used. By the way, did you use a 64 bit compile or a 32 bit compile for your tests? It would be nice to see a comparison between these and between MINIMIZE_MAGIC and no MINIMIZE_MAGIC and to see if the increase (mabe even decrease, I'm not sure) in speed is truly worth the extra size.

Gerd Isenberg wrote:So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard.

And it intends to remain that way :mrgreen:

.

Uri wrote:It does not seem to me a move generator and maybe it is only a part of a move generator.

It's the main (and common) part of the move generator for bitboards. It isn't actually the move generator.

Tests in a complete bitboard engine would be better than perft I guess.

by **Dann Corbit** » 26 Aug 2006, 03:15

Uri Blass wrote:
Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.

"You can use this to generate move bitboards for your own chess engine if you wish"

In other words people can start their own chess programs from a foreign move geneator.

I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.

If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"

After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.

Uri

But where did we get alpha-beta?

How about pvs? The history reductions idea from Fruit and Glaurung?

I don't see any harm in it. But I also see the other side where some people get really mad about it (e.g. Smirf author breathes fire about showing other people your code).

There seems to be a groundswell in computer chess that you have to write everything from scratch, but we always borrow ideas at least.

I also agree that there is greater value in reading someone's paper and/or code and understanding the idea then implementing that idea instead of using their code. But some bits of writing a chess engine (e.g the Winboard interface or UCI part) are just tiring crap, that a patient monkey could finish. Now, there are grey lines where if you borrow too much stuff, the engine is not your own.

But can you say that every idea in your engine is totally original? Anyone who says that is not truthful. Or, if they are truthful, they will have a crappy engine unless they are Knuth or Sedgewick or some other freak of nature.

IMO-YMMV.

by **Dann Corbit** » 26 Aug 2006, 03:19

Gerd Isenberg wrote:I agree that sharing ideas or pseudocode has another quality than sharing source code, conveniently to try. I also don't like the open source idea too much - and have problems with concrete frameworks, programs and even perft with unclear license conditions...

So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard. There are the classical rotated approaches, Lasse's approach and zillion others doing the same. All well known. I think this is still an atomic function, somehow on the same level like bitscan and popcount or Kogge-Stone fill routines. This is also a kind of competition to get the fastest sldiding attack getter, and the most dense tables...

Gerd

If we stop sharing ideas, then the competition has become too intense to be of the great value that gentle competition has.

Suppose that Wirth had hidden the ideas for quicksort, or Shannon never told us about his chess ideas. Eventually, we will probably discover them anyway, but it will retard us 20 years, especially if everyone else is doing the same thing. Is it really worth it to write a faster program than someone else and set the computer science field back so that we can cross the finish line ahead of the others?

Just a viewpoint.

P.S.
You are a person who openly shares his ideas all the time, so no complaints here. I am just shouting into the air.

by **Pradu** » 26 Aug 2006, 06:40

I've uploaded another modification to magicmoves. This one will express the macros as inlined functions if USE_INLINING is defined in the header file. Still looking forward to performances vs other move generators (and versus variations of itself) on 64-bit computers with 32/64 bit compiles. 8-)

by **Uri Blass** » 26 Aug 2006, 08:40

Dann Corbit wrote:
Uri Blass wrote:
Pradu wrote:Downoad the magic move bitboard generator here:

www.prism.gatech.edu/~gtg365v/Buzz

Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.

"You can use this to generate move bitboards for your own chess engine if you wish"

In other words people can start their own chess programs from a foreign move geneator.

I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.

If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"

After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.

Uri

But where did we get alpha-beta?

How about pvs? The history reductions idea from Fruit and Glaurung?

I don't see any harm in it. But I also see the other side where some people get really mad about it (e.g. Smirf author breathes fire about showing other people your code).

There seems to be a groundswell in computer chess that you have to write everything from scratch, but we always borrow ideas at least.

I also agree that there is greater value in reading someone's paper and/or code and understanding the idea then implementing that idea instead of using their code. But some bits of writing a chess engine (e.g the Winboard interface or UCI part) are just tiring crap, that a patient monkey could finish. Now, there are grey lines where if you borrow too much stuff, the engine is not your own.

But can you say that every idea in your engine is totally original? Anyone who says that is not truthful. Or, if they are truthful, they will have a crappy engine unless they are Knuth or Sedgewick or some other freak of nature.

IMO-YMMV.

sharing idea is not a problem and the problem is sharing chess playing code except code that is about nalimov tablebases.

In the case of history reductions
I had history reductions based on history counters before fruit and glaurung.

I later added pruning based on ideas in fruit code but I had bugs in my code and I am not sure if this made movei better.

later I decided to delete history counters because it made the code too complicated so today I have only late move reductions that are not based on history counters even if it did not make movei stronger.

I agree that not everything should be written from scratch
I think that there is no problem with using code that is not chess code(like winboard interface code or generating random numbers)

I also think that it may be possible to use some small functions
that Gerd posted in the past but full move generator seems to me something that should not be used as copy and paste in another engine and I initially did not understand that the code that was posted is not close to be full move generator and the title gave me wrong impression.

Uri

by **Pradu** » 26 Aug 2006, 08:49

Uri wrote:the title gave me wrong impression

Yes, I guess Magic _Move Bitboard_ Generator and Magic _Bitboard Move_ Generator could be a little confusing.

by **Gerd Isenberg** » 26 Aug 2006, 09:29

Pradu wrote:
Sorry Gerd, I do not understand the assembler that well; but I've taken your written suggestions and produced another version of magicmoves (currently uploaded). In this one you can define MINIMIZE_MAGIC in the header file and produce 841kb databases. Otherwise 2304kb databases will be used. By the way, did you use a 64 bit compile or a 32 bit compile for your tests? It would be nice to see a comparison between these and between MINIMIZE_MAGIC and no MINIMIZE_MAGIC and to see if the increase (mabe even decrease, I'm not sure) in speed is truly worth the extra size.

Ok, this is 64-bit assembly, generated by msvc2005 c++ compiler. I think 32-bit mode doesn't make much sense with 64-bit multiplication - a call with three 32-bit imuls. For 32-bit mode one has to find some folding tricks to get the occupied state by mul * magig32 >> shift32.

If you see some x86 assembly with r8..r15 or r8d...r15d and rxx instead of exx, it is 64-bit mode. All address or index registers are the 64-bit registers, while for arithmetical/logical operations the target register may still be the 32-bit registers, since 32-bit (int) is still the default openrand size.

In 64-bit mode, function parameter (up to three for msvc) are passed via registers, rcx, rdx, r8? or something (fastcall). Return is per convention via rax (64 bit) or eax (32-bit) (even byte/char return is possible via partial low byte register al).

Your rook getter is called with occupied in register rcx and square in edx (32-bit is fine here). I'll give some further remarks on the assembly of your rook getter:

Code: Select all: mov eax, edx ; copy square from edx into register eax ; a 32-bit move or operation on a 32-bit regsister implicitly clears the ; upper 32-bit of the "shadowed" 64-bit register rax, which is further used ; to index the arrays with various scaling (rax*4, rax*8) for int or U64 arrays. lea r8, OFFSET FLAT:__ImageBase ; hmm..., in 64-bit mode we need a base register to access globals! mov rdx, QWORD PTR ?magicmoves_r_mask@@3QB_KB[r8+rax*8] ; get the magicmoves_r_mask[BASE + 8*sq] into rdx ; the index scaling rax*8 is required because we need to index 8-byte items and rdx, rcx ; rdx := rdx & rcx -> mask := mask & occupied movzx ecx, BYTE PTR ?magicmoves_r_shift@@3QBHB[r8+rax*4] ; get the shift amount, magicmoves_r_shift[BASE + 4*sq] ; here the scaling is 4 due to 4-byte integer size ; the strange thing is that the compiler takes a byte-move with zero extension to ; 32-bit (movzx), rather than the shorter and faster mov ecx, DWORD PTR .. ; variable shift on x86 is only possible with the byte register cl, which ; is lower part of ecx or even rcx. imul rdx, QWORD PTR ?magicmoves_r_magics@@3QB_KB[r8+rax*8] ; the 64-bit multiplication rdx := rdx * memory-operand ; the masked occupied in rdx is multiplied with ; magicmoves_r_magics[BASE + 8*square]. The product is stored in ; source1/target rdx-register again, so that the initial mask, no longer needed, ; is overwritten. mov rax, QWORD PTR ?magicmoves_r_indecies@@3PAPEB_KA[r8+rax*8] ; get the pointer to the array base into by 8-byte scaled square (rax*8) into rax, ; so square formerly in rax is overwitten shr rdx, cl ; the 64-bit right shift of the product, to get the occupied index mov rax, QWORD PTR [rax+rdx*8] ; get the final attack-set *(ptr + occpied), again scaling of 8 ret 0 ; return to the caller

I will report on not MINIMIZE_MAGIC later...

Gerd

by **Gerd Isenberg** » 26 Aug 2006, 16:13

Congrats Pradu, seems your MINIMIZE_MAGIC is really a lot faster, not to say fastest, specially for the rooks in this snooby framework.
Puhh - i was dead wrong with my guess the homogenious arrays may faster, as the pure assembly below might suggest.
I did not expect that extreme slowdown with the huge arrays. They totally kill the rook/queen performance - at least with this unrealistic occupied pattern, snoobing 5-bit around and traversing squares from 0..63. Of course, exchanging square and occupied-state dimensions in the databases would leave a much better cache performance here, since sq is traversed in the inner loop. So i tried to relax the square variance by "anding" the square parameter of the attack-getter with 1,3,7,15,and 31 to see how memory behaves. For the sake of completeness i missused kogge-stone attacks (gp-registers) with additional 1<<sq:

All times in ns per call (+overhead), this is an AMD 64X2 4800+ 2.4GHz:

Code: Select all: Rook attack sq&1 sq&3 sq&7 sq&15 sq&31 sq KoggeStone 15.177 15.177 15.177 15.177 15.177 15.177 DenseFirstRank 6.660 7.269 7.046 7.078 7.046 6.660 !MINIMIZE_MAGIC 4.740 5.476 6.115 6.498 8.068 18.700 MINIMIZE_MAGIC 4.420 4.867 5.379 5.699 5.730 6.084 Bishop attack KoggeStone 16.460 16.460 16.460 16.460 16.460 16.460 DenseFirstRank 7.908 7.429 7.398 7.429 7.396 7.589 !MINIMIZE_MAGIC 4.740 4.547 4.482 4.484 4.996 5.252 MINIMIZE_MAGIC 4.420 4.484 4.451 4.388 4.388 4.162 Queen attack KoggeStone 30.133 30.133 30.133 30.133 30.133 30.133 DenseFirstRank 11.527 10.566 10.952 10.952 10.952 11.945 !MINIMIZE_MAGIC 6.437 8.390 7.972 8.837 12.745 32.148 MINIMIZE_MAGIC 6.084 6.660 7.365 7.173 8.101 8.486

Again, in a concrete chess program with real positions, things may behave completely different... But it somehow demonstrates, how volatile huge memory may behave, and what may happen if there are a lot of other memory traffic around and L1 starts to pollute...

Here the generated 64-bit assembly of Lasse's rook-getter with three instead of five memory accesses. The shl 12 is necessary to scale the square index since scale factors in the address operand are only possible with *1,2,4,8.

Code: Select all: occ$ = 8 sq$ = 16 ?LasseRookAttacks@@YA_K_KI@Z PROC ; LasseRookAttacks, COMDAT 00000 44 8b c2 mov r8d, edx 00003 48 8d 15 00 00 00 00 lea rdx, OFFSET FLAT:__ImageBase 0000a 4a 8b 84 c2 00 00 00 00 mov rax, QWORD PTR ?magicmoves_r_mask@@3QB_KB[rdx+r8*8] 00012 48 23 c1 and rax, rcx 00015 4a 0f af 84 c2 00 00 00 00 imul rax, QWORD PTR ?magicmoves_r_magics@@3QB_KB[rdx+r8*8] 0001e 49 c1 e0 0c shl r8, 12 00022 48 c1 e8 34 shr rax, 52 00026 49 03 c0 add rax, r8 00029 48 8b 84 c2 00 00 00 00 mov rax, QWORD PTR ?magicmovesrdb@@3PAY0BAAA@_KA[rdx+rax*8] 00031 c3 ret 0 ?LasseRookAttacks@@YA_K_KI@Z ENDP ; LasseRookAttacks

This is the current snooby, which indicates that 9-cycle vector path bsf is slightly superior than deBruijn mul with lookup - less register pressure and of course shorter code.

Code: Select all: #include <intrin.h> u64 snoob (u64 x) { unsigned long lidx; _BitScanForward64(&lidx, x); u64 ls1b = x & -(i64)x; u64 ripp = x + ls1b; u64 ones = x ^ ripp; u64 sone = ((ones >> 2) >> lidx); return ripp + sone; } x$ = 8 ?snoob@@YA_K_K@Z PROC ; snoob, COMDAT 00000 4c 8b c1 mov r8, rcx 00003 48 0f bc d1 bsf rdx, rcx 00007 49 f7 d8 neg r8 0000a 4c 23 c1 and r8, rcx 0000d 4c 03 c1 add r8, rcx 00010 49 8b c0 mov rax, r8 00013 48 33 c1 xor rax, rcx 00016 8b ca mov ecx, edx 00018 48 c1 e8 02 shr rax, 2 0001c 48 d3 e8 shr rax, cl 0001f 49 03 c0 add rax, r8 00022 c3 ret 0 ?snoob@@YA_K_K@Z ENDP ; snoob

Gerd

by **Pradu** » 26 Aug 2006, 18:40

Very interesting; mabe a perfect hash by indirection will be even faster! The perfect hash will take ~50kb for the moves and the indexing table will take ~100kb with chars, ~200kb with shorts, ~400 kb using ints for MINIMIZE_MAGIC or ~250kb with chars, ~500kb with shorts, ~1 MB using ints for no MINIMIZE_MAGIC. So a minimal move database (rookMoves[4900] bishopMoves[1428] can be made into as small as 150kb. I'll start implementing this and lets see if it turns out to be even faster. But I guess we cannot infer performance in a real engine by just using this test. Nevertheless, I'll leave all possibilities open in magicmoves.

by **Gerd Isenberg** » 26 Aug 2006, 20:57

Pradu wrote:Very interesting; mabe a perfect hash by indirection will be even faster! The perfect hash will take ~50kb for the moves and the indexing table will take ~100kb with chars, ~200kb with shorts, ~400 kb using ints for MINIMIZE_MAGIC or ~250kb with chars, ~500kb with shorts, ~1 MB using ints for no MINIMIZE_MAGIC. So a minimal move database (rookMoves[4900] bishopMoves[1428] can be made into as small as 150kb. I'll start implementing this and lets see if it turns out to be even faster. But I guess we cannot infer performance in a real engine by just using this test. Nevertheless, I'll leave all possibilities open in magicmoves.

Yes, that might be interesting and even required to eventually index precalculated move-lists. I mean not passing occupied boards, but rook/bishop attack (sub)-sets to safe bitscan loops. But probably that becomes easier with disjoint file, rank and diagonals directions. What about hashing pawn push and pawn capture targets of a pair of two ranks in one run? And the up to eight knight and king targets? To make a complete branch- and loopless move-generator? ;-)

by **Pradu** » 27 Aug 2006, 02:32

Gerd Isenberg wrote:Yes, that might be interesting and even required to eventually index precalculated move-lists. I mean not passing occupied boards, but rook/bishop attack (sub)-sets to safe bitscan loops. But probably that becomes easier with disjoint file, rank and diagonals directions. What about hashing pawn push and pawn capture targets of a pair of two ranks in one run? And the up to eight knight and king targets? To make a complete branch- and loopless move-generator?

It will take up a good deal of memory for a branch- and loopless move-generator

. Anyways char's can't be used as indecies, too small

. I'm using shorts. I've posted a perfect-hashing database generator but it dosen't work in conjunction with MINIMIZE_MAGICS yet so it is taking up ~550k. In this one you must have MINIMIZE_MAGICS commented out and the following defined in the header file:
#define PERFECT_MAGIC_HASH unsigned short

I'll work on compressing the indexing database to bring the total size of the databases down to ~250k.

by **Andreas Guettinger** » 27 Aug 2006, 12:56

Ok, I tried this out in my bitboard move generator. I didn't try the MINIMIZE_MAGIC part because I don't understand what it does.

Test computer was a Apple Power Mac G5 2.5 Ghz.
32-bit compile: -Wall -O2
64-bit compile: -Wall -O2 -mcpu=G5 -mtune=G5 -fast
Test engine was AnimaX.

Code: Select all: 32-bit: ------- A B C Normal: 34.63 s 7.63 s 10.91 s Rotated: 20.42 s 3.72 s 9.63 s Magic: 16.35 s 3.48 s 7.02 s 64-bit: ------- A B C Normal: 19.87 s 4.41 s 7.23 s Rotated: 14.03 s 2.47 s 6.65 s Magic: 11.25 s 2.46 s 5.25 s Explanations: A) Time to perft 6 of Anima B) Time to gererate 80000000 moves C) Time to Generate+make+undo 80000000 moves

Interestingly the speed of the magicmoves generator on my machine was about the same than my rotated bitboard move generator, but the makemove undomove was much faster do to not having to update the rotated occupied squares bitmaps. All in all for perft there was a considerable speed gain.

Note that the perft function of Anima doesn't use any tricks or hashtables and does all the make/undo moves even at the leave nodes. It also generates captures and noncaptures seperately for every node and concatenates the two move lists together, in short, it's not designed for speed.

by **Gerd Isenberg** » 27 Aug 2006, 14:25

Andreas Guettinger wrote:Ok, I tried this out in my bitboard move generator. I didn't try the MINIMIZE_MAGIC part because I don't understand what it does.

You don't have this in the first line of magicmoves.h?

Code: Select all: #define MINIMIZE_MAGIG

Then you didn't use the dense 841Kb lookup but the huge 2304kb lookup? MINIMIZE_MAGIC uses almost different factors and shift-amount per square.

Not defining MINIMIZE_MAGIC is like Lasse Hansen's approach, each square, to get a 12 bit address for all possible (12..14) rook-occupied states:

Code: Select all: pattern := occupied & rookmask [sq] address := (pattern * rookMagig[sq]) >> (64-12) attacks := rookAttacks[sq][address]

or with a one-dimensional array

Code: Select all: u64 rookAttacks[64*4096] sqBase := sq<<12 attacks := rookAttacks[sqBase + address]

Pradu took advantage of the fact, that a lot of squares require less than 12-bit adressess to get rook attacks for all relevant occupied bits and was able to dense the 2MB to 800KByte for rooks, and each square-lookup has a different shift amount and an own slot with almost different size in the one-dimensional array:

Code: Select all: pattern := occupied & rookmask [sq]; address := (pattern * rookMagig[sq]) >> rookShift[sq] // 9..12 bit address sqBase := rooksOffset[sq] // was sq<<12 on Lasse's approach attacks := rookAttacks[sqBase + address]

I was skeptical in the beginning that this two additional memory accesses, to dense 2MByte to 800KByte for rooks or 256KByte to 41KByte for bishops pays off - but it seems i was wrong as my dumb test-framework suggests. I guess also tlb pollution becomes an issue with a 2MB table - tlb maps virtual addresses of 4KByte pages to physical addresses. So if you simply #define MINIMIZE_MAGIC in the header file and test again...

Gerd

by **Andreas Guettinger** » 27 Aug 2006, 15:23

Gerd Isenberg wrote:So if you simply #define MINIMIZE_MAGIC in the header file and test again...

Gerd

Ahem, I didn't get the point that to uncomment the define is enough. Anyway, I uncommented #define MINIMIZE_MAGIC, make clean, and make again but the result didn't change much. Maybe in a full search there is a difference.

Code: Select all: 32-bit: ------- A B C Normal: 34.63 s 7.63 s 10.91 s Rotated: 20.42 s 3.72 s 9.63 s Magic: 16.35 s 3.48 s 7.02 s Minimize: 16.86 s 3.61 s 7.09 s 64-bit: ------- A B C Normal: 19.87 s 4.41 s 7.23 s Rotated: 14.03 s 2.47 s 6.65 s Magic: 11.25 s 2.46 s 5.25 s Minimize: 11.27 s 2.43 s 5.25 s Explanations: A) Time to perft 6 of Anima B) Time to gererate 80000000 moves C) Time to Generate+make+undo 80000000 moves

by **Pradu** » 27 Aug 2006, 19:23

Nice to hear it is doing as good or better than the rotated approach in both 32-bits and 64-bits! Magic bitboard generators are also a lot more versitile than rotated bitboad generators. Take for example the issue of X-ray attacks.

Code: Select all: attacks=Rmoves(square,occupancy); xrayattacks=Rmoves(square,(occupancy&~attacks)) & ~attacks

It isnt possible to do something like this in rotated.

I found a bug in the PEFRECT_HASH approach, fixing.

Winboard Forum

Fastest Magic Move Bitboard Generator ready to use

Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Re: Fastest Magic Move Bitboard Generator ready to use

Who is online