Winboard Forum

Posted: **05 Feb 2008, 19:19**

Hello,

Apple is supposed to release an SDK for the iPhone and iPod Touch some time this month, and of course I want to see my engine run on this platform as soon as possible. I therefore plan to start working on a more simple, compact and light-weight version of Glaurung 2 some time soon.

I've been away from computer chess for a while, and I am not sure I am up to date about the latest developments in bitboard move generation. Which technique is currently considered to be the best with a small amount of available memory? Would it perhaps be better not to use bitboards at all?

Tord

Posted: **05 Feb 2008, 20:46**

How about Gerd's 1.5k approach? (Should be on the chess programming wiki)

Posted: **05 Feb 2008, 20:56**

The ARM1176 is a nice 32-bit processor, even with SIMD-instructions. JVM, C-compilers and devtools:

http://www.arm.com/products/DevTools/Re ... Suite.html
http://infocenter.arm.com/help/index.jsp

Since it has a 32-bit Byte-reverse word REV{cond} <Rd>, <Rm> instruction to convert little to big-endian or vice versa, likely available as C-intrinsic, I suggest hyperbola quintessence for file and diagonal-getters with 2 KByte lookups. Another 512 bytes are needed for the rank-attacks, that is 2.5K in total. No multiplication required.

http://chessprogramming.wikispaces.com/ ... intessence

It even has a leading zero count intrinsic, see:
http://www.arm.com/products/DevTools/RVCT.html
Targeting Advanced Maths and DSP-Style Solutions

Wow - it has a reverse subtraction! Oups, only other order of operands...

Posted: **05 Feb 2008, 23:13**

Gerd Isenberg wrote:The ARM1176 is a nice 32-bit processor, even with SIMD-instructions. JVM, C-compilers and devtools:

http://www.arm.com/products/DevTools/Re ... Suite.html
http://infocenter.arm.com/help/index.jsp

Thanks for the links! It looks nice, but I am a little bit afraid that Apple's development tools will be a bit more restricted.

Since it has a 32-bit Byte-reverse word REV{cond} <Rd>, <Rm> instruction to convert little to big-endian or vice versa, likely available as C-intrinsic, I suggest hyperbola quintessence for file and diagonal-getters with 2 KByte lookups. Another 512 bytes are needed for the rank-attacks, that is 2.5K in total. No multiplication required.

http://chessprogramming.wikispaces.com/ ... intessence

"Hyperbola quintessence"? What a name!

It even has a leading zero count intrinsic, see:
http://www.arm.com/products/DevTools/RVCT.html
Targeting Advanced Maths and DSP-Style Solutions

Wow - it has a reverse subtraction! Oups, only other order of operands...

This all seems very nice, if I'm allowed to use the instructions. So far, no details about Apple's SDK have surfaced, and there is a risk that they will put severe restrictions on what third-party programs are allowed to do. My worst fear is that it will only be possible to compile to some sort of byte-code running on a virtual machine...

Anyway, thanks for the information! I'll have a close look at the hyperbola quintessence stuff.

Tord

Posted: **09 Jan 2009, 12:25**

Tord Romstad wrote:
Gerd Isenberg wrote:The ARM1176 is a nice 32-bit processor, even with SIMD-instructions. JVM, C-compilers and devtools:

http://www.arm.com/products/DevTools/Re ... Suite.html
http://infocenter.arm.com/help/index.jsp

Thanks for the links! It looks nice, but I am a little bit afraid that Apple's development tools will be a bit more restricted.

Since it has a 32-bit Byte-reverse word REV{cond} <Rd>, <Rm> instruction to convert little to big-endian or vice versa, likely available as C-intrinsic, I suggest hyperbola quintessence for file and diagonal-getters with 2 KByte lookups. Another 512 bytes are needed for the rank-attacks, that is 2.5K in total. No multiplication required.

http://chessprogramming.wikispaces.com/ ... intessence

"Hyperbola quintessence"? What a name!

It even has a leading zero count intrinsic, see:
http://www.arm.com/products/DevTools/RVCT.html
Targeting Advanced Maths and DSP-Style Solutions

Wow - it has a reverse subtraction! Oups, only other order of operands...

This all seems very nice, if I'm allowed to use the instructions. So far, no details about Apple's SDK have surfaced, and there is a risk that they will put severe restrictions on what third-party programs are allowed to do. My worst fear is that it will only be possible to compile to some sort of byte-code running on a virtual machine...

Anyway, thanks for the information! I'll have a close look at the hyperbola quintessence stuff.

It's been a very busy year, and only know, almost one year after I started this thread, I have started experimenting with bitboards on the ARM. The results so far are disappointing: When I use the magic bitboard implementation found in the public version of Glaurung, the speed drops by a factor of 100 on the 412 MHz ARM compared to when running on a single core of a 2.8 GHz Core 2 Duo. Most non-bitboard programs I have tried are only about 30-50 times slower on the ARM. Switching from magic bitboards to hyperbolic quintessence makes it even worse: The speed drops by about 20%. In 64-bit mode on the Core 2 Duo, on the other hand, HQ runs at almost exactly the same speed as magic bitboards.

Using the ARM's clz instruction rather than a standard folded bitscan for bitscanning does not improve performance measurably.

I now think the best approach on the ARM would be to get rid of the bitboards entirely.

Here is the code I use for sliding attack generation. If someone spots any obvious mistakes, please point them out:

Code: Select all: inline uint32_t bswap32(uint32_t u) { asm("rev %0, %1" :"=r"(u) :"0"(u)); return u; } inline uint64_t bswap(uint64_t b) { uint32_t w1 = (uint32_t)b, w2 = (uint32_t)(b>>32); return (uint64_t)bswap32(w2) | ((uint64_t)(bswap32(w1)) << 32); } inline Bitboard rank_attacks_bb(Square s, Bitboard b) { b = (b >> ((s & 56) + 1)) & 63; return RankAttacks[square_file(s)][b] & rank_bb(s); } inline Bitboard file_attacks_bb(Square s, Bitboard b) { b &= SqData[s].fMask; return ((b - SqData[s].bMask) ^ bswap(bswap(b) - bswap(SqData[s].bMask))) & SqData[s].fMask; } inline Bitboard diag_attacks_bb(Square s, Bitboard b) { b &= SqData[s].dMask; return ((b - SqData[s].bMask) ^ bswap(bswap(b) - bswap(SqData[s].bMask))) & SqData[s].dMask; } inline Bitboard xdiag_attacks_bb(Square s, Bitboard b) { b &= SqData[s].xdMask; return ((b - SqData[s].bMask) ^ bswap(bswap(b) - bswap(SqData[s].bMask))) & SqData[s].xdMask; } inline Bitboard bishop_attacks_bb(Square s, Bitboard b) { return diag_attacks_bb(s, b) | xdiag_attacks_bb(s, b); } inline Bitboard rook_attacks_bb(Square s, Bitboard b) { return file_attacks_bb(s, b) | rank_attacks_bb(s, b); } inline Bitboard queen_attacks_bb(Square s, Bitboard b) { return bishop_attacks_bb(s, b) | rook_attacks_bb(s, b); }

Tord

Posted: **09 Jan 2009, 20:38**

Tord Romstad wrote:It's been a very busy year, and only know, almost one year after I started this thread, I have started experimenting with bitboards on the ARM. The results so far are disappointing: When I use the magic bitboard implementation found in the public version of Glaurung, the speed drops by a factor of 100 on the 412 MHz ARM compared to when running on a single core of a 2.8 GHz Core 2 Duo. Most non-bitboard programs I have tried are only about 30-50 times slower on the ARM. Switching from magic bitboards to hyperbolic quintessence makes it even worse: The speed drops by about 20%. In 64-bit mode on the Core 2 Duo, on the other hand, HQ runs at almost exactly the same speed as magic bitboards.

Using the ARM's clz instruction rather than a standard folded bitscan for bitscanning does not improve performance measurably.

I now think the best approach on the ARM would be to get rid of the bitboards entirely.

Here is the code I use for sliding attack generation. If someone spots any obvious mistakes, please point them out:

Tord

Hi Tord,
seems that ARM1176 is an in order processor without any parallel decoding/execution. So slow execution that memory access becomes relative fast, even with huge tables (therefor the better magic performance). Additionally the rev-instruction may be slow - since it is usually not performance critical like bitwise boolean or arithmetical instructions and might therefor implemented with some internal macro-program. Are there any papers with instruction latencies?

Also 32-bit doesn't allow to keep stuff in registers a lot, thus 64-bit aka 2*32 bit really becomes a bottleneck due to additional use of locals on the stack rather than inside registers. May be the compiler is also weak in optimizing 64-bit stuff. May be you can provide some generated assembly of bishop_attacks_bb?

You may try to save one bswap, by either indexing with xor 56 or using one additional structure element with precalculated flipped bits. But I fear that will not help that much

Code: Select all: inline Bitboard file_attacks_bb(Square s, Bitboard b) { b &= SqData[s].fMask; return ((b - SqData[s].bMask) ^ bswap(bswap(b) - SqData[s^56].bMask)) & SqData[s].fMask; } inline Bitboard file_attacks_bb(Square s, Bitboard b) { b &= SqData[s].fMask; return ((b - SqData[s].bMask) ^ bswap(bswap(b) - SqData[s].revbMask)) & SqData[s].fMask; }

Have you tried to not inline the piece attack getters?

Gerd

Posted: **10 Jan 2009, 16:57**

Hi Gerd,

Thanks for the suggestions! I'll give them a try when I find the time, and let you know how it works.

Tord

Winboard Forum

Bitboards with limited memory

Bitboards with limited memory

Re: Bitboards with limited memory

Re: Bitboards with limited memory

Re: Bitboards with limited memory

Re: Bitboards with limited memory

Re: Bitboards with limited memory

Re: Bitboards with limited memory