Moderator: Andres Valverde
Pradu wrote:Downoad the magic move bitboard generator here:
www.prism.gatech.edu/~gtg365v/Buzz
Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.
Rook:
time1 in sec = 3.266 runs 7624512*64 ~ 6.693 ns
time2 in sec = 2.922 runs 7624512*64 ~ 5.988 ns
Bishop:
time1 in sec = 3.703 runs 7624512*64 ~ 7.589 ns
time2 in sec = 2.031 runs 7624512*64 ~ 4.162 ns
Queen:
time1 in sec = 5.875 runs 7624512*64 ~12.040 ns
time2 in sec = 4.109 runs 7624512*64 ~ 8.421 ns
int main(int argc, char* argv[])
{
u64 atta=0, first = 0x1f, x, y, cnt;
clock_t start, stop;
init(); // my stuff
initmagicmoves();
start = clock();
for (x = first, cnt=1; (y = snoob(x)) > x; x = y, cnt++)
for (int sq = 0; sq < 64; sq++)
atta ^= rookAttacks(y, (u32)sq);
stop = clock();
printf("time1 in sec = %6.3f runs %lld*64 ~%6.3f ns\n", (float)(stop-start)/CLOCKS_PER_SEC, cnt, ((float)(stop-start)/64*1000000.0) / cnt);
start = clock();
for (x = first, cnt=1; (y = snoob(x)) > x; x = y, cnt++)
for (int sq = 0; sq < 64; sq++)
atta ^= PraduRookAttacks(y, sq);
stop = clock();
printf("time2 in sec = %6.3f runs %lld*64 ~%6.3f ns\n", (float)(stop-start)/CLOCKS_PER_SEC, cnt, ((float)(stop-start)/64*1000000.0) / cnt);
if ( atta != 0 ) {
printf("oups, something is broken\n");
}
return 0;
}
; Function compile flags: /Ogtpy
occ$ = 8
sq$ = 16
?PraduRookAttacks@@YA_K_KI@Z PROC ; PraduRookAttacks, COMDAT
00000 8b c2 mov eax, edx
00002 4c 8d 05 00 00
00 00 lea r8, OFFSET FLAT:__ImageBase
00009 49 8b 94 c0 00
00 00 00 mov rdx, QWORD PTR ?magicmoves_r_mask@@3QB_KB[r8+rax*8]
00011 48 23 d1 and rdx, rcx
00014 41 0f b6 8c 80
00 00 00 00 movzx ecx, BYTE PTR ?magicmoves_r_shift@@3QBHB[r8+rax*4]
0001d 49 0f af 94 c0
00 00 00 00 imul rdx, QWORD PTR ?magicmoves_r_magics@@3QB_KB[r8+rax*8]
00026 49 8b 84 c0 00
00 00 00 mov rax, QWORD PTR ?magicmoves_r_indecies@@3PAPEB_KA[r8+rax*8]
0002e 48 d3 ea shr rdx, cl
00031 48 8b 04 d0 mov rax, QWORD PTR [rax+rdx*8]
00035 c3 ret 0
?PraduRookAttacks@@YA_K_KI@Z ENDP ; PraduRookAttacks
; Function compile flags: /Ogtpy
occ$ = 16
sq$ = 24
?rookAttacks@@YA_K_KI@Z PROC ; rookAttacks, COMDAT
00000 40 53 push rbx
00002 44 8b c2 mov r8d, edx
00005 44 8b da mov r11d, edx
00008 4c 8b d1 mov r10, rcx
0000b 8b ca mov ecx, edx
0000d 49 8b d2 mov rdx, r10
00010 48 b8 01 01 01
01 01 01 01 01 mov rax, 0101010101010101H
0001a 83 e1 07 and ecx, 7
0001d 41 83 e3 f8 and r11d, -8
00021 48 8d 1d 00 00
00 00 lea rbx, OFFSET FLAT:?firstRankAttacks@@3PAY07EA
00028 48 d3 ea shr rdx, cl
0002b 80 f1 07 xor cl, 7
0002e 48 23 d0 and rdx, rax
00031 48 b8 00 04 08
10 20 40 80 00 mov rax, 0080402010080400H
0003b 48 0f af d0 imul rdx, rax
0003f 49 8b c0 mov rax, r8
00042 48 83 f0 38 xor rax, 56
00046 48 c1 e8 03 shr rax, 3
0004a 48 c1 ea 3a shr rdx, 58
0004e 48 8d 04 d0 lea rax, QWORD PTR [rax+rdx*8]
00052 41 8a d0 mov dl, r8b
00055 44 0f b6 0c 18 movzx r9d, BYTE PTR [rax+rbx]
0005a 48 b8 01 02 04
08 10 20 40 80 mov rax, 8040201008040201H
00064 83 e2 07 and edx, 7
00067 4c 0f af c8 imul r9, rax
0006b 48 b8 80 80 80
80 80 80 80 80 mov rax, 8080808080808080H
00075 4c 23 c8 and r9, rax
00078 49 d3 e9 shr r9, cl
0007b 41 8d 4b 01 lea ecx, DWORD PTR [r11+1]
0007f 49 d3 ea shr r10, cl
00082 41 8b cb mov ecx, r11d
00085 41 83 e2 3f and r10d, 63
00089 4e 8d 04 d2 lea r8, QWORD PTR [rdx+r10*8]
0008d 41 0f b6 04 18 movzx eax, BYTE PTR [r8+rbx]
00092 48 d3 e0 shl rax, cl
00095 49 0b c1 or rax, r9
00098 5b pop rbx
00099 c3 ret 0
?rookAttacks@@YA_K_KI@Z ENDP ; rookAttacks
Pradu wrote:Downoad the magic move bitboard generator here:
www.prism.gatech.edu/~gtg365v/Buzz
Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.
Gerd Isenberg wrote:Your approach reduces the size considerable - specially for the bishops. I don't like the additional indirection with the pointer array, plus variable shift array, so much. Whether the two additional accesses, even if L1, pay off, is the question. If you pick some part from your 64KByte L1 cache, from ~1MB or 4MB tables, probably doesn't matter so much. There are eventually L2/L3-cache and page issues. Guess Lasse's approach with larger dense, but homogenious arrays, and constant shift is faster.
Gerd Isenberg wrote:So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard.
It's the main (and common) part of the move generator for bitboards. It isn't actually the move generator.Uri wrote:It does not seem to me a move generator and maybe it is only a part of a move generator.
Uri Blass wrote:Pradu wrote:Downoad the magic move bitboard generator here:
www.prism.gatech.edu/~gtg365v/Buzz
Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.
"You can use this to generate move bitboards for your own chess engine if you wish"
In other words people can start their own chess programs from a foreign move geneator.
I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.
If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"
After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.
Uri
Gerd Isenberg wrote:I agree that sharing ideas or pseudocode has another quality than sharing source code, conveniently to try. I also don't like the open source idea too much - and have problems with concrete frameworks, programs and even perft with unclear license conditions...
So far Pradu's code is only about to get rook- and bishop-attack-sets by square index of the slider and the occupied bitboard. There are the classical rotated approaches, Lasse's approach and zillion others doing the same. All well known. I think this is still an atomic function, somehow on the same level like bitscan and popcount or Kogge-Stone fill routines. This is also a kind of competition to get the fastest sldiding attack getter, and the most dense tables...
Gerd
Dann Corbit wrote:Uri Blass wrote:Pradu wrote:Downoad the magic move bitboard generator here:
www.prism.gatech.edu/~gtg365v/Buzz
Because I don't have the necessary hardware (I only have a P4), I would appriciate it if someone could test how fast it is on 32/64 bit processors and comparisions in speed to other move bitboard generators. The code is made in such way that it can be integrated into any existing bitboard chess program very easily.
"You can use this to generate move bitboards for your own chess engine if you wish"
In other words people can start their own chess programs from a foreign move geneator.
I dislike that idea and I wonder if it is not going to be considered as a clone in tournaments.
If we let people to use that move generator in tournaments then tomorrow somebody is going to write some good evaluation function and write:
"You can use this to generate evaluation for your own chess engine if you wish"
After it continues in this way the total result is going to be that people are going to have non original chess engines that are considered as their engines.
Uri
But where did we get alpha-beta?
How about pvs? The history reductions idea from Fruit and Glaurung?
I don't see any harm in it. But I also see the other side where some people get really mad about it (e.g. Smirf author breathes fire about showing other people your code).
There seems to be a groundswell in computer chess that you have to write everything from scratch, but we always borrow ideas at least.
I also agree that there is greater value in reading someone's paper and/or code and understanding the idea then implementing that idea instead of using their code. But some bits of writing a chess engine (e.g the Winboard interface or UCI part) are just tiring crap, that a patient monkey could finish. Now, there are grey lines where if you borrow too much stuff, the engine is not your own.
But can you say that every idea in your engine is totally original? Anyone who says that is not truthful. Or, if they are truthful, they will have a crappy engine unless they are Knuth or Sedgewick or some other freak of nature.
IMO-YMMV.
Uri wrote:the title gave me wrong impression
Pradu wrote:
Sorry Gerd, I do not understand the assembler that well; but I've taken your written suggestions and produced another version of magicmoves (currently uploaded). In this one you can define MINIMIZE_MAGIC in the header file and produce 841kb databases. Otherwise 2304kb databases will be used. By the way, did you use a 64 bit compile or a 32 bit compile for your tests? It would be nice to see a comparison between these and between MINIMIZE_MAGIC and no MINIMIZE_MAGIC and to see if the increase (mabe even decrease, I'm not sure) in speed is truly worth the extra size.
mov eax, edx
; copy square from edx into register eax
; a 32-bit move or operation on a 32-bit regsister implicitly clears the
; upper 32-bit of the "shadowed" 64-bit register rax, which is further used
; to index the arrays with various scaling (rax*4, rax*8) for int or U64 arrays.
lea r8, OFFSET FLAT:__ImageBase
; hmm..., in 64-bit mode we need a base register to access globals!
mov rdx, QWORD PTR ?magicmoves_r_mask@@3QB_KB[r8+rax*8]
; get the magicmoves_r_mask[BASE + 8*sq] into rdx
; the index scaling rax*8 is required because we need to index 8-byte items
and rdx, rcx
; rdx := rdx & rcx -> mask := mask & occupied
movzx ecx, BYTE PTR ?magicmoves_r_shift@@3QBHB[r8+rax*4]
; get the shift amount, magicmoves_r_shift[BASE + 4*sq]
; here the scaling is 4 due to 4-byte integer size
; the strange thing is that the compiler takes a byte-move with zero extension to
; 32-bit (movzx), rather than the shorter and faster mov ecx, DWORD PTR ..
; variable shift on x86 is only possible with the byte register cl, which
; is lower part of ecx or even rcx.
imul rdx, QWORD PTR ?magicmoves_r_magics@@3QB_KB[r8+rax*8]
; the 64-bit multiplication rdx := rdx * memory-operand
; the masked occupied in rdx is multiplied with
; magicmoves_r_magics[BASE + 8*square]. The product is stored in
; source1/target rdx-register again, so that the initial mask, no longer needed,
; is overwritten.
mov rax, QWORD PTR ?magicmoves_r_indecies@@3PAPEB_KA[r8+rax*8]
; get the pointer to the array base into by 8-byte scaled square (rax*8) into rax,
; so square formerly in rax is overwitten
shr rdx, cl
; the 64-bit right shift of the product, to get the occupied index
mov rax, QWORD PTR [rax+rdx*8]
; get the final attack-set *(ptr + occpied), again scaling of 8
ret 0
; return to the caller
Rook attack sq&1 sq&3 sq&7 sq&15 sq&31 sq
KoggeStone 15.177 15.177 15.177 15.177 15.177 15.177
DenseFirstRank 6.660 7.269 7.046 7.078 7.046 6.660
!MINIMIZE_MAGIC 4.740 5.476 6.115 6.498 8.068 18.700
MINIMIZE_MAGIC 4.420 4.867 5.379 5.699 5.730 6.084
Bishop attack
KoggeStone 16.460 16.460 16.460 16.460 16.460 16.460
DenseFirstRank 7.908 7.429 7.398 7.429 7.396 7.589
!MINIMIZE_MAGIC 4.740 4.547 4.482 4.484 4.996 5.252
MINIMIZE_MAGIC 4.420 4.484 4.451 4.388 4.388 4.162
Queen attack
KoggeStone 30.133 30.133 30.133 30.133 30.133 30.133
DenseFirstRank 11.527 10.566 10.952 10.952 10.952 11.945
!MINIMIZE_MAGIC 6.437 8.390 7.972 8.837 12.745 32.148
MINIMIZE_MAGIC 6.084 6.660 7.365 7.173 8.101 8.486
occ$ = 8
sq$ = 16
?LasseRookAttacks@@YA_K_KI@Z PROC ; LasseRookAttacks, COMDAT
00000 44 8b c2 mov r8d, edx
00003 48 8d 15 00 00 00 00 lea rdx, OFFSET FLAT:__ImageBase
0000a 4a 8b 84 c2 00 00 00 00 mov rax, QWORD PTR ?magicmoves_r_mask@@3QB_KB[rdx+r8*8]
00012 48 23 c1 and rax, rcx
00015 4a 0f af 84 c2 00 00 00 00 imul rax, QWORD PTR ?magicmoves_r_magics@@3QB_KB[rdx+r8*8]
0001e 49 c1 e0 0c shl r8, 12
00022 48 c1 e8 34 shr rax, 52
00026 49 03 c0 add rax, r8
00029 48 8b 84 c2 00 00 00 00 mov rax, QWORD PTR ?magicmovesrdb@@3PAY0BAAA@_KA[rdx+rax*8]
00031 c3 ret 0
?LasseRookAttacks@@YA_K_KI@Z ENDP ; LasseRookAttacks
#include <intrin.h>
u64 snoob (u64 x) {
unsigned long lidx;
_BitScanForward64(&lidx, x);
u64 ls1b = x & -(i64)x;
u64 ripp = x + ls1b;
u64 ones = x ^ ripp;
u64 sone = ((ones >> 2) >> lidx);
return ripp + sone;
}
x$ = 8
?snoob@@YA_K_K@Z PROC ; snoob, COMDAT
00000 4c 8b c1 mov r8, rcx
00003 48 0f bc d1 bsf rdx, rcx
00007 49 f7 d8 neg r8
0000a 4c 23 c1 and r8, rcx
0000d 4c 03 c1 add r8, rcx
00010 49 8b c0 mov rax, r8
00013 48 33 c1 xor rax, rcx
00016 8b ca mov ecx, edx
00018 48 c1 e8 02 shr rax, 2
0001c 48 d3 e8 shr rax, cl
0001f 49 03 c0 add rax, r8
00022 c3 ret 0
?snoob@@YA_K_K@Z ENDP ; snoob
Pradu wrote:Very interesting; mabe a perfect hash by indirection will be even faster! The perfect hash will take ~50kb for the moves and the indexing table will take ~100kb with chars, ~200kb with shorts, ~400 kb using ints for MINIMIZE_MAGIC or ~250kb with chars, ~500kb with shorts, ~1 MB using ints for no MINIMIZE_MAGIC. So a minimal move database (rookMoves[4900] bishopMoves[1428] can be made into as small as 150kb. I'll start implementing this and lets see if it turns out to be even faster. But I guess we cannot infer performance in a real engine by just using this test. Nevertheless, I'll leave all possibilities open in magicmoves.
Gerd Isenberg wrote:Yes, that might be interesting and even required to eventually index precalculated move-lists. I mean not passing occupied boards, but rook/bishop attack (sub)-sets to safe bitscan loops. But probably that becomes easier with disjoint file, rank and diagonals directions. What about hashing pawn push and pawn capture targets of a pair of two ranks in one run? And the up to eight knight and king targets? To make a complete branch- and loopless move-generator?
32-bit:
-------
A B C
Normal: 34.63 s 7.63 s 10.91 s
Rotated: 20.42 s 3.72 s 9.63 s
Magic: 16.35 s 3.48 s 7.02 s
64-bit:
-------
A B C
Normal: 19.87 s 4.41 s 7.23 s
Rotated: 14.03 s 2.47 s 6.65 s
Magic: 11.25 s 2.46 s 5.25 s
Explanations:
A) Time to perft 6 of Anima
B) Time to gererate 80000000 moves
C) Time to Generate+make+undo 80000000 moves
Andreas Guettinger wrote:Ok, I tried this out in my bitboard move generator. I didn't try the MINIMIZE_MAGIC part because I don't understand what it does.
#define MINIMIZE_MAGIG
pattern := occupied & rookmask [sq]
address := (pattern * rookMagig[sq]) >> (64-12)
attacks := rookAttacks[sq][address]
u64 rookAttacks[64*4096]
sqBase := sq<<12
attacks := rookAttacks[sqBase + address]
pattern := occupied & rookmask [sq];
address := (pattern * rookMagig[sq]) >> rookShift[sq] // 9..12 bit address
sqBase := rooksOffset[sq] // was sq<<12 on Lasse's approach
attacks := rookAttacks[sqBase + address]
Gerd Isenberg wrote:So if you simply #define MINIMIZE_MAGIC in the header file and test again...
Gerd
32-bit:
-------
A B C
Normal: 34.63 s 7.63 s 10.91 s
Rotated: 20.42 s 3.72 s 9.63 s
Magic: 16.35 s 3.48 s 7.02 s
Minimize: 16.86 s 3.61 s 7.09 s
64-bit:
-------
A B C
Normal: 19.87 s 4.41 s 7.23 s
Rotated: 14.03 s 2.47 s 6.65 s
Magic: 11.25 s 2.46 s 5.25 s
Minimize: 11.27 s 2.43 s 5.25 s
Explanations:
A) Time to perft 6 of Anima
B) Time to gererate 80000000 moves
C) Time to Generate+make+undo 80000000 moves
attacks=Rmoves(square,occupancy);
xrayattacks=Rmoves(square,(occupancy&~attacks)) & ~attacks
Return to Programming and Technical Discussions
Users browsing this forum: No registered users and 12 guests