... and in seems not very efficient in 64 bit Very Happy
Yes, 64-KByte lookups may pollute your L1-Cache somehow...
Three conditional branches may produce a miss-predicted jump up and then... You may also try Matt Taylor's folded BitScan with 32-bit multiplication or some longlong-double conversion, you'll find in the archives here...
Ok for bit[64]*byte[64] dot product but...where is it usefull in your chess program ? Maybe in evaluation function ?
Yes, probably to do a weighted popcount of attack-sets for several piece types. Usually center is more important than the edges. You may also have array of weight arrays indexed by some king square to consider attacks/defends near the king. For instance with some bitwise instructions you may combine two bishop- (usually on different square colors) with two knight attacks, and to "code" those attacks with two correspondating bits of two bitboards. Value "one" for odd number of attacks and "two" for at least double attacks, to call the dot-product only two - instead of four times for literally all "light" pieces per side ...
There is even a faster dot-product with 90-degree rotation which might be used with precalculated weight arrays. Also if your average weight is below let say 64, you may save some additional cycles to do the horizontal add via _mm_sad_epu8 only once after three vertical bytewise adds (probably with saturation). So you can go with about 34 cycles or so - and some day if we get 128-bit alus 17 cycles or even less...
And what about the fill stuff ? I'm having big troubles to find a way to compute king safety efficiently ; maybe bitboard filling can help there ?
Kogge-Stone Fillstuff is quite expensive register computation, but has the advantage to get sliding attacks setwise, saving a loop and bitscans, eg. with for all rooks and queens per side in the south direction (gen(erator) == rooks|queens; pro(pagator) == empty squares):
- Code: Select all
__forceinline
u64 southAttacks(u64 gen, u64 pro) {
gen |= pro & (gen >> 8);
pro = pro & (pro >> 8);
gen |= pro & (gen >> 16);
pro = pro & (pro >> 16);
gen |= pro & (gen >> 32);
return gen >> 8;
}
With quad-bitboards (aka a pair of 128-bit xmm registers) and vertical nibbles with up to 15 "random" piece codes, there are some creative possibilities, eg. to consider the king and sorrounding squares on different rays per direction to get a set of "virtual" sliding attacks from those squares - so you can easily look with some intersections whether own or opposite rooks/bishops/queen defend/attack those squares, and how often...
- Code: Select all
template <class T> __forceinline
void southAttacks(QB& t, const QB& s, const u64 &empty)
{
T* pt = (T*)&t;
T g0((T&)s.bb[0]);
T g1((T&)s.bb[2]);
T p(empty);
g0 |= p & (g0 >> 8);
g1 |= p & (g1 >> 8);
p = p & ( p >> 8);
g0 |= p & (g0 >> 16);
g1 |= p & (g1 >> 16);
p = p & ( p >> 16);
g0 |= p & (g0 >> 32);
g1 |= p & (g1 >> 32);
pt[0] = (g0 >> 8);
pt[1] = (g1 >> 8);
}
T is either XMM (the intrinsic wrapper) or GP (standard C++ with u64[4], MMX sucks somehow with msc) - the generated assembly looks promising so far. The idea is to push ipc to the max by interlaced scheduling (done hopefully by the compiler with inlined functions) of different direction fills. One may also try to mix different register files for best performance.