Congrats Pradu, seems your MINIMIZE_MAGIC is really a lot faster, not to say fastest, specially for the rooks in this snooby framework.
Puhh - i was dead wrong with my guess the homogenious arrays may faster, as the pure assembly below might suggest.
I did not expect that extreme slowdown with the huge arrays. They totally kill the rook/queen performance - at least with this unrealistic occupied pattern, snoobing 5-bit around and traversing squares from 0..63. Of course, exchanging square and occupied-state dimensions in the databases would leave a much better cache performance here, since sq is traversed in the inner loop. So i tried to relax the square variance by "anding" the square parameter of the attack-getter with 1,3,7,15,and 31 to see how memory behaves. For the sake of completeness i missused kogge-stone attacks (gp-registers) with additional 1<<sq:
All times in ns per call (+overhead), this is an AMD 64X2 4800+ 2.4GHz:
- Code: Select all
Rook attack sq&1 sq&3 sq&7 sq&15 sq&31 sq
KoggeStone 15.177 15.177 15.177 15.177 15.177 15.177
DenseFirstRank 6.660 7.269 7.046 7.078 7.046 6.660
!MINIMIZE_MAGIC 4.740 5.476 6.115 6.498 8.068 18.700
MINIMIZE_MAGIC 4.420 4.867 5.379 5.699 5.730 6.084
Bishop attack
KoggeStone 16.460 16.460 16.460 16.460 16.460 16.460
DenseFirstRank 7.908 7.429 7.398 7.429 7.396 7.589
!MINIMIZE_MAGIC 4.740 4.547 4.482 4.484 4.996 5.252
MINIMIZE_MAGIC 4.420 4.484 4.451 4.388 4.388 4.162
Queen attack
KoggeStone 30.133 30.133 30.133 30.133 30.133 30.133
DenseFirstRank 11.527 10.566 10.952 10.952 10.952 11.945
!MINIMIZE_MAGIC 6.437 8.390 7.972 8.837 12.745 32.148
MINIMIZE_MAGIC 6.084 6.660 7.365 7.173 8.101 8.486
Again, in a concrete chess program with real positions, things may behave completely different... But it somehow demonstrates, how volatile huge memory may behave, and what may happen if there are a lot of other memory traffic around and L1 starts to pollute...
Here the generated 64-bit assembly of Lasse's rook-getter with three instead of five memory accesses. The shl 12 is necessary to scale the square index since scale factors in the address operand are only possible with *1,2,4,8.
- Code: Select all
occ$ = 8
sq$ = 16
?LasseRookAttacks@@YA_K_KI@Z PROC ; LasseRookAttacks, COMDAT
00000 44 8b c2 mov r8d, edx
00003 48 8d 15 00 00 00 00 lea rdx, OFFSET FLAT:__ImageBase
0000a 4a 8b 84 c2 00 00 00 00 mov rax, QWORD PTR ?magicmoves_r_mask@@3QB_KB[rdx+r8*8]
00012 48 23 c1 and rax, rcx
00015 4a 0f af 84 c2 00 00 00 00 imul rax, QWORD PTR ?magicmoves_r_magics@@3QB_KB[rdx+r8*8]
0001e 49 c1 e0 0c shl r8, 12
00022 48 c1 e8 34 shr rax, 52
00026 49 03 c0 add rax, r8
00029 48 8b 84 c2 00 00 00 00 mov rax, QWORD PTR ?magicmovesrdb@@3PAY0BAAA@_KA[rdx+rax*8]
00031 c3 ret 0
?LasseRookAttacks@@YA_K_KI@Z ENDP ; LasseRookAttacks
This is the current snooby, which indicates that 9-cycle vector path bsf is slightly superior than deBruijn mul with lookup - less register pressure and of course shorter code.
- Code: Select all
#include <intrin.h>
u64 snoob (u64 x) {
unsigned long lidx;
_BitScanForward64(&lidx, x);
u64 ls1b = x & -(i64)x;
u64 ripp = x + ls1b;
u64 ones = x ^ ripp;
u64 sone = ((ones >> 2) >> lidx);
return ripp + sone;
}
x$ = 8
?snoob@@YA_K_K@Z PROC ; snoob, COMDAT
00000 4c 8b c1 mov r8, rcx
00003 48 0f bc d1 bsf rdx, rcx
00007 49 f7 d8 neg r8
0000a 4c 23 c1 and r8, rcx
0000d 4c 03 c1 add r8, rcx
00010 49 8b c0 mov rax, r8
00013 48 33 c1 xor rax, rcx
00016 8b ca mov ecx, edx
00018 48 c1 e8 02 shr rax, 2
0001c 48 d3 e8 shr rax, cl
0001f 49 03 c0 add rax, r8
00022 c3 ret 0
?snoob@@YA_K_K@Z ENDP ; snoob
Gerd