Winboard Forum

by **Sune Fischer** » 30 Jun 2005, 07:42

diepeveen wrote:Still you didn't touch the subject SPEED.

My code when NOT using inline assembly is even at 64 bits A64 about factor 1.5 faster (2.0-2.2 times faster at 32 bits K7).

I question your method of testing this.

It's very non-trivial to figure out which is faster overall in a chess program IMO. Using a primitive perft is deeply flawed.

Some things you didn't consider is that there is a need to generate captures much more often than non-captures.
Essentially the main question becomes if your move generator can generate 40 moves as fast as a bitboard generator can generate 5 moves.

This doesn't take into account that working with a short movelist (for e.g. sorting) is always faster than working with a long movelist.

Then there is the 64 bit gain..

I think the comparison is not easy, but honestly being 1.5x faster in perft doesn't sound impressive if you consider the many possible optimizations for the bitboarder in a real search.

diepeveen wrote:Also you didn't touch the subject GENERIC code.

Crafty is *not* generic code.

If he gets a bunch of hits from indirection, how much more will it slow down?

Like he's having
BITBOARD whitepawns;
BITBOARD blackpawns;

that's always faster than
BITBOARD pawns[2];

and working with arrays,
also the latter one is more generic and crafty is *not* doing that.

In the makemove, crafty has code for each piece.

How can you defend *that* as generic code?

Peter is correct that there are no portability issues with bitboards, the assembly is easily replaced by small simple C routines albeit a tad slower.

By generic code you mean keep everything in arrays and loop it - small complex code better than long easy-to-read code?

That's a question of personal preference I think, there is no right or wrong on that matter AFAI concerned.

-S

by **Alessandro Scotti** » 30 Jun 2005, 10:32

Pallav Nawani wrote:About how much is the advantage (say in %) you get when you use inline assembly?

IIRC, it was no more than 3% last time I run this test. The asm code itself is quite old and possibly non-optimal, but CPUs are changing rapidly these days and I don't enjoy much working on this stuff. If you want to run some tests yourself there is Crafty of course, or I can send you Kiwi sources to play with, they should be relatively easy to read (and here at least I can be of some help if needed!)

by **Pallav Nawani** » 30 Jun 2005, 17:22

Alessandro Scotti wrote:
Pallav Nawani wrote:About how much is the advantage (say in %) you get when you use inline assembly?

IIRC, it was no more than 3% last time I run this test. The asm code itself is quite old and possibly non-optimal, but CPUs are changing rapidly these days and I don't enjoy much working on this stuff.

Neither do I, and that's the reason I asked this question. If the improvement is minimal, then I can skip writing assembly stuff entirely. :mrgreen:

I am sooo lazy!

Thanks for offering to send me the Kiwi code. I will ask for it later if I run into some problems.

Bye,
Pallav

by **diepeveen** » 30 Jun 2005, 18:18

Writing entire chessprogram in assembly speeds up about factor 2, assuming your C code is real good. You can really find real low level tricks, save everywhere branches and find ideal savings to mix 8 bit instructions with 32 bits and so on.

I have seen real far going examples from probably the best bit-dicker (positive meant!) which is Eric van Rietpaap. He is just using 1 compare to check a whole lot of conditions in order to speed up check whether you can castle. He's doing a single 32 bits compare and uses each 8 bits of them for a condition. So he saves out 3 compares by doing just 1, and keeps those conditions throughout entire program within that single double word.

For those who have not programmed low level assembly this is nearly impossible to imagine how much speedup especially at x86 processors you can get by programming in assembly.

For bad programmed C code, the speedup is even more drastic, but of course better first learn program C then.

The whole question here is of course whether it is WORTH to program in assembly. There is a number of side effects to it:
a) you are nonstop busy debugging. Really 99.99% of
your time goes to being in debug environments.
Having 2 MB of C code means quickly
10MB of assembly code.
b) assembly just works for 1 compiler in 1 OS.
c) writing optimal assembly is a fulltime job in itself. Maintaining generic C code is SO SO easy. Today i've been busy with a position diep misevaluated bigtime. Fruit 2.1 said black up 0.3 to it and for some weird reason diep's kingsafety and other small factors contributed to it that diep said white +0.890 to the position. In C it's far easier to figure out where the problems are than in Assembly.

So the answer is YES, assembly speeds up huge amounts, but please also realize how much time it takes and how incompatible it is, as well as that a chessprogram in assembly is simply fulltime work. How many have fulltime time to do that?

IMHO assembly is only interesting now for the wristwatch/mobile cellphone processors.

A chessprogram in assembly at such a cellphone will kick butt of course.
6 ply versus 7. You bet 7 is better than 6.

But if GENERIC code can get you just 0.1% faster with only a few days work, you sure should do that effort!

0.1% for just some programming work is a LOT of speed. You win it once and never give it back.

Vincent

p.s. i feel i can speedup diep's speed about a 10% if i would mix everywhere 8 bits datastructures with 32 bits. But that adds a lot of debugging problems too and the sureness of bugs in 2.2 MB source code.

by **Reinhard Scharnagl** » 30 Jun 2005, 19:21

Vincent,

to what you have written on assembler I mostly agree. I have some experiences in assembler. Once upon a time I have e.g. written a complete floating point library including trigonometric functions, sqrt e.t.c, to a time, where no coprocessors existed (Z80), as a binding to a PASCAL compiler.

Actually it is very difficult to write real optimized code by hand because of the existing multiple pipes within the CPU. The difference of a compiled well written C/C++ program compared to an assembler written one is not that big as factor two in average. But indeed there could be parts of a program, where it would be a good tuning idea to translate them into sophisticated assembler language.

But hunting merely 0.1% by using assembler would be a bad idea, because chances are huge, that you would get 100% frustrated by that.

And using assembler should be done AFTER all concepts have been realized and successfully tested. Also even assembler is not able to compensate a bad data design. And it is (mostly) unnecessary to translate a whole project into assembler language.

Reinhard.

by **diepeveen** » 30 Jun 2005, 21:06

[quote="Reinhard Scharnagl"]Vincent,

to what you have written on assembler I mostly agree. I have some experiences in assembler. Once upon a time I have e.g. written a complete floating point library including trigonometric functions, sqrt e.t.c, to a time, where no coprocessors existed (Z80), as a binding to a PASCAL compiler.

Actually it is very difficult to write real optimized code by hand because of the existing multiple pipes within the CPU. The difference of a compiled well written C/C++ program compared to an assembler written one is not that big as factor two in average. But indeed there could be parts of a program, where it would be a good tuning idea to translate them into sophisticated assembler language.

But hunting merely 0.1% by using assembler would be a bad idea, because chances are huge, that you would get 100% frustrated by that.

And using assembler should be done AFTER all concepts have been realized and successfully tested. Also even assembler is not able to compensate a bad data design. And it is (mostly) unnecessary to translate a whole project into assembler language.

Reinhard.[/quote]

At modern processors which are so so complex it really is a factor 2.

To give a simple example.

Compilers will NEVER use a conditional branch (except GCC if you force it to, but gcc is just *so* inefficient that of course there is a 1000 other ways to improve *that* code there in assembly).

At opteron/k7 this is like 2 extra cycles, whereas savings is nearly 20 cycles.

For some prime type software i wrote that mattered really a lot.

Secondly there is SSE2 and SSE possibilities. For floating point software usually that speeds up up to a factor 4.

With respect to chess you can save out everywhere loads of compares which speeds up hands down a factor 2.

I would rather argue if you are prepared to write unreadable code you can get it factor 3 faster.

It's not only pipes and branches that's slow. Each sequence of instructions has its own advantages/disadvantages.

Perfect hand scheduling of instructions and so on. It really is a *huge* difference.

Obviously you will be busy rewriting parts of your code up to 100 times litterary. LITTERARY. It really takes painful amounts of time to optimize to that level.

<H2>But that it's lightning fast, there is *no* question about it.</H2>

Vincent

by **Gerd Isenberg** » 30 Jun 2005, 23:01

Pallav Nawani wrote:
Alessandro Scotti wrote:Also, a version without inline assembly ran only slightly slower and it's now how the program is compiled for all platforms but Windows, where assembly still grants a small advantage.

About how much is the advantage (say in %) you get when you use inline assembly?
Alessandro Scotti wrote:However, I'm considering writing a new engine and choosing whether to use bitboards or not is an important decision so I would like to understand more on this matter.

Funny you'd say that, because I have started to slowly switch Natwarlal from 0x88 based board representation to bitboards. Since you have experience in making a bitboard based program, perhaps you can tell me some tricks and gotchas about bitboards?

As far as I can see, these are the advantages for Array based representations:
(a) Faster move generation.
(b) Flexibility. Bitboards can only be used to make chess programs with 8x8 board. If you want hexagonal chess, or 10x12 (whatever it is) etc, bitboards are not an option.

-----
I agree with you - otoh, as bitboards with 2*32bit, why not other boards with 2*64bits or an n-bit bitset class. (Sliding) Attack generation may be done by kogge-stone fillroutines as well.
-----

(c) Lower memory footprint (?). I saw that beowulf had 2 tables of size 64K on stack for a bitcount function!

-----
Of course there are other simd wise algorithms for population counting one bits, calculation versus huge tables indexed by 16-bit words. One may try both versions because things interact a lot and often behave "chaotical" with modern percessors.
-----

And now the advantages for bitboards that I can see:
(a) Faster evaluation. Calculating whether a pawn is passed or not requires just one and.
(b) No need to mess around with piece lists and such, bitboards already do the job well.
(c) No need to scan the empty square when generating captures.
(d) No need to test for the end of the board.
(e) Future proof: Bitboards will have a great speed advantage when all the PCs become 64 bit!

-----
Depending on the design. For rotated bitboards not that much between x86-32 versus x86-64 - may be also due to compiler are stil too "new" and the compiler guys have a lot to improve. For fill based bitboards as much as 64-bit registers (or mmx/sse2) as possible - no chance with x86-32..

Imho important is the immanent parallelism by working with sets.
For instance you can get sets of passers by shifting up/down opposite pawns ored with their left/right attacks. All own pawns outside that set are passers. That implies a kind of loop hoisting. Instead of looking for properties of singular pawns/pieces inside an inner loop, one build sets of pawns/pieces with these properties.
-----

Also see this paper (In case you haven't already).
http://www.cis.uab.edu/info/faculty/hyatt/boardrep.html

Alessandro Scotti wrote:What problems do you see for mobility? At present, evaluating mobility for say a bishop requires generating the bishop moves bitboard (which isn't too slow) and counting the bits.

I don't think that there are any problems at all.. In an array based representation, evaluating mobility requires generating the bishop moves, and counting them. Same thing.

-----

fyi - if SSE2 available. Not that portable, for P4 and K8 in 32-bit and 64-bit mode. The SSE2- intrinsics are also available for msc, intel c++ and iirc GCC as well. It implements a kind of weighted bitcount in about 40 cycles.

/* for average weights < 64 */
int dotProduct64(BitBoard bb, BYTE weights[] /* XMM_ALIGN */)
{
static const BitBoard XMM_ALIGN sbitmask[2] = {
0x8040201008040201,
0x8040201008040201
};
__m128i x0, x1, x2, x3, bm;
bm = _mm_load_si128 ( (__m128i*) sbitmask);
x0 = _mm_loadl_epi64 ( (__m128i*) &bb);
// extend bits to bytes
x0 = _mm_unpacklo_epi8 (x0, x0);
x2 = _mm_unpackhi_epi16 (x0, x0);
x0 = _mm_unpacklo_epi16 (x0, x0);
x1 = _mm_unpackhi_epi32 (x0, x0);
x0 = _mm_unpacklo_epi32 (x0, x0);
x3 = _mm_unpackhi_epi32 (x2, x2);
x2 = _mm_unpacklo_epi32 (x2, x2);
x0 = _mm_cmpeq_epi8 (_mm_and_si128 (x0, bm), bm);
x1 = _mm_cmpeq_epi8 (_mm_and_si128 (x1, bm), bm);
x2 = _mm_cmpeq_epi8 (_mm_and_si128 (x2, bm), bm);
x3 = _mm_cmpeq_epi8 (_mm_and_si128 (x3, bm), bm);
// multiply by "and" with -1 or 0
__m128i* pw = (__m128i*) weights;
x0 = _mm_and_si128 (x0, pw[0]);
x1 = _mm_and_si128 (x1, pw[1]);
x2 = _mm_and_si128 (x2, pw[2]);
x3 = _mm_and_si128 (x3, pw[3]);
// add all bytes (with saturation)
x0 = _mm_adds_epu8 (x0, x1);
x0 = _mm_adds_epu8 (x0, x2);
x0 = _mm_adds_epu8 (x0, x3);
x0 = _mm_sad_epu8 (x0, _mm_setzero_si128 ());
return _mm_extract_epi16 (x0, 0)
+ _mm_extract_epi16 (x0, 4);
}

- Gerd

Pallav

by **diepeveen** » 01 Jul 2005, 00:45

Gerd,

Can you also show us how to avoid bitboards being factor 3 times slower or so when we want to use more chessknowledge than just a simple summation count nor when we want a simplistic vector multiply.

Just simple logics for each square also using attacktables and especially how many times a square gets attacked.

A crucial thing all humans always take into account.

Note also the real old gnuchess 4.0 (non bitboards) allows using this.

Also Ed's datastructure allows it more or less in his 8 bits attacktable datastructure.

Please show me bitboards and in a generic code that also works at other processors than opteron.

I'm not using SSE2 assembly code in DIEP because i hate to be dependant upon running only with a certain windows compiler at A64's.

Diep also runs under linux and at other processors than A64's.

Please show something generic that also works at other processors than A64.

And now the advantages for bitboards that I can see:
(a) Faster evaluation. Calculating whether a pawn is passed or not requires just one and.

Difficult to read evaluation in bitboards. Every bitmask you must check and recheck 10 times. Calculation of any complex loop is dead slow.
Slow vector instructions just kill your processors IPC.

Slower evaluation than is possible in non-bitboards because you only have 1 bit of information. To do any sort of 'count' is very hard.

Try to get the total count out of 8 different bitboards at position square_c3.

In my "straightforward' datastructure i just do count = x[i]&15;

And i have that number.

(b) No need to mess around with piece lists and such, bitboards already do the job well.

In bitboards you need to write out for black and white everything, see crafty code. If you argue that's not needed to do in bitboards, then please lift your arm high in the sky indicating us whether you wrote generic code that is both working for both sides and several pieces in a generic way.

What qualifies is if you use this in the searching engine :
BITBOARD piecelist[2][6];

(c) No need to scan the empty square when generating captures.

In short all you can do in bitboards is captures?

Excuse me, if there is 40 semi legal moves, i want to generate *all* of them. There might be some move inside that raises my score a lot.

This where the lightning fast speed to scan squares gets used for my mobility, activity, scans throughout the evaluation (which eats all system time). That code is 2.2 times faster than 32 bits bitboards and like 1.5 to 64 bits bitboards.

(d) No need to test for the end of the board.

With precomputed tables there is no need to do that either to generate.

(e) Future proof: Bitboards will have a great speed advantage when all the PCs become 64 bit!

As you can prove, 32 bits to 64 bits move speeds you up around 30% (Crafty). So instead of a datastructure being factor 2.2 times slower and completely incapable to handle what todays strong programs need; that's easy datastructure that is fast and allows for little bugs

(f) bitboard programs loaded with bugs because of mistakes in conversion between 32 and 64 bits.

(g) very difficult to add more knowledge in bitboards. only the KISS principle gets used for knowledge. Complex knowledge you won't find there.

(h) the Hyatt argument, bitboards are very useful when you don't have a fast L1 cache in your processor and then search for a solution to avoid array lookups as those go to straight memory.

Good news for you, your processor DOES have a L1 cache, and a FAST one !!

(i) As a bitboarder if you lose a game, you can complain it is not your fault but from the compiler builder

Depending on the design. For rotated bitboards not that much between x86-32 versus x86-64 - may be also due to compiler are stil too "new" and the compiler guys have a lot to improve. For fill based bitboards as

Of course, compilers are made by machines who are 100% bugfree coders. In reality it is some stupidity either from you or from the compiler guysf in 32/64 bits. It doesn't matter who made the mistake. What matters is that you have to write code that works bugfree. Generic code usually is doing that. If you lose because of a compiler bug or because of some dumb casting bug you overlooked, that doesn't matter for your opponent. He'll take the point.

(i) in bitboards in the year 2005 you are still busy with hexadecimal values. Now not 8 bits long like they were in the 80s, but a confusing 64 bits in bitboards.

If you write in a normal datastructure some chessknowledge, it requires relative easy debugging to get the thing working correct. In bitboards you really must recheck things 10 times. Masks look real funny you know:

0x000290000000001000

Do you know whether the above has the right mask selected for your bitboard pattern?

Easy checking in bitboards!

In Non-bitboards things are just so much easier to write generic and if you know the coordinate system in chess pretty well, like where a2 is and that a2 != a3, then the odds for mistakes there is far smaller. If there is a problem it's identified quicker!

(j) As a bitboarder in the year 2005 you are still busy with (in)line assembly

Way to go. How to *ever* expand your program when all you are busy with is finding a clever way to do some utmost tiny thing 1 cycle faster in SSE2?

(k) As a bitboarder you just know for sure that you'll face opponents running at more processors than you do. Majority of parallel engines are *not* bitboard engines. How many parallel bitboard engines are there actually?

In order of strength: Zappa, Crafty, ... ?

Imho important is the immanent parallelism by working with sets.

Faster than being busy with parallellism within the processor is parallellism between processors!

Brings you factor 3.x speedup at a dual opteron dual core.

(l) the Frans Morsch passer argument

At how many spots in your eval do you need to know whether something is a passer. If more than 2, why not keep a bitboards array containing all the passers you got?

BITBOARD passers[2];

Real easy.

Hey, that's even faster than calculating them!

Let me quote again why i definitely do NOT want to think about dicking with bitboards, as i like to be busy with a bit more highlevel things that contain more chessknowledge than a vector multiply gives:

int dotProduct64(BitBoard bb, BYTE weights[] /* XMM_ALIGN */)
{
static const BitBoard XMM_ALIGN sbitmask[2] = {
0x8040201008040201,
0x8040201008040201
};
__m128i x0, x1, x2, x3, bm;
bm = _mm_load_si128 ( (__m128i*) sbitmask);
x0 = _mm_loadl_epi64 ( (__m128i*) &bb);
// extend bits to bytes
x0 = _mm_unpacklo_epi8 (x0, x0);
x2 = _mm_unpackhi_epi16 (x0, x0);
x0 = _mm_unpacklo_epi16 (x0, x0);
x1 = _mm_unpackhi_epi32 (x0, x0);
x0 = _mm_unpacklo_epi32 (x0, x0);
x3 = _mm_unpackhi_epi32 (x2, x2);
x2 = _mm_unpacklo_epi32 (x2, x2);
x0 = _mm_cmpeq_epi8 (_mm_and_si128 (x0, bm), bm);
x1 = _mm_cmpeq_epi8 (_mm_and_si128 (x1, bm), bm);
x2 = _mm_cmpeq_epi8 (_mm_and_si128 (x2, bm), bm);
x3 = _mm_cmpeq_epi8 (_mm_and_si128 (x3, bm), bm);
// multiply by "and" with -1 or 0
__m128i* pw = (__m128i*) weights;
x0 = _mm_and_si128 (x0, pw[0]);
x1 = _mm_and_si128 (x1, pw[1]);
x2 = _mm_and_si128 (x2, pw[2]);
x3 = _mm_and_si128 (x3, pw[3]);
// add all bytes (with saturation)
x0 = _mm_adds_epu8 (x0, x1);
x0 = _mm_adds_epu8 (x0, x2);
x0 = _mm_adds_epu8 (x0, x3);
x0 = _mm_sad_epu8 (x0, _mm_setzero_si128 ());
return _mm_extract_epi16 (x0, 0)
+ _mm_extract_epi16 (x0, 4);
}

- Gerd

Very clever the above Gerd, that you manage to do a vector operation in SSE2 of an array times a bitboard. Now the most important question is of course, does being busy for a full year with a few tiny optimizations also bring 50 rating points for Isichess?

It's obvious that each year chesssoftware has to improve in order to keep its status in the world top. Whether that's 50 ratingpoints or less is not relevant question here. But it is *some* sort of improvement.

What becomes world champion in 2003 won't win in 2004. What won in 2004 won't win in 2005. Progress is needed because competitors also improve each year.

How is bitboard engines going to manage that?

(w) Being bitboards makes it a real challenge to win the world title; after the 80s never a bitboards engine again won the world title

-----
Vincent

by **Zach Wegner** » 01 Jul 2005, 06:57

Hi Vincent,

I'd like to comment some on this post, but not quite all of it, it's pretty long

My kind-of old program was bitboard-based, but is different from your characterization of bitboards in some ways. It has only a couple lines of assembly that can be turned off, so it works on other platforms (Well actually I haven't programmed in windows functions for input and time, but it wouldn't be too hard at all). It used attack tables, which IMO are more useful than Ed's. There's an old thread in this forum on attack tables where I explained it in more depth, but for now I will just say that it compresses data into 8 bits to count the attacks of (almost) every piece type, and the compression just involves 2 256-byte arrays, so it's not too slow. It allows me to do a pretty accurate SEE in practically no time. The way I update these tables works quite well with bitboards, but uses some extra memory. My engine uses fairly generic code, with one piece of code for both sides, but the pieces are still separated out. I agree with you about the captures point, I always generate all moves in the tree, although there still is a speedup in quiescence (at least relative to non-bitboard programs from main search to quiescence). I tried to get this program to be SMP, I wrote out an iterative search and all the implementation of DTS, but it was so hard for me to debug it never really got off the ground. I think what you did with Diep getting it to run on that super computer was pretty impressive. You obviously have a lot more experience in threaded programming; I learned pretty much everything I know about SMP from Crafty and the posix man pages. My old engine is pretty crappy though, it's so buggy that I've never really got it running well enough for public consumption. The evaluation isn't really tuned very well at all, and it ends up playing pretty badly. The code is old and hard to work with (I've been using the same code base as when I first learned bitboards, this was probably when I was 16). I've just started a new program from scratch though, with a lot of better design features. I will eventually release it open source, and I think it might be something you would be interested in. I'm thinking of having a large complicated evaluation, selective search, as well as SMP in an easy-to-read generic program.

Anyways, some very interesting discussion going on in here lately; glad you joined the forum.

Zach

by **Tord Romstad** » 01 Jul 2005, 11:00

Hi Zach!

Zach Wegner wrote:My kind-of old program was bitboard-based, but is different from your characterization of bitboards in some ways. It has only a couple lines of assembly that can be turned off, so it works on other platforms (Well actually I haven't programmed in windows functions for input and time, but it wouldn't be too hard at all). It used attack tables, which IMO are more useful than Ed's. There's an old thread in this forum on attack tables where I explained it in more depth, but for now I will just say that it compresses data into 8 bits to count the attacks of (almost) every piece type, and the compression just involves 2 256-byte arrays, so it's not too slow. It allows me to do a pretty accurate SEE in practically no time.

What I miss most after abandoning attack tables is the ability to compute safe mobility (as opposed to pseudo-legal mobility) almost for free. Perhaps I will add attack tables to Glaurung some time in the future.

I've just started a new program from scratch though, with a lot of better design features. I will eventually release it open source, and I think it might be something you would be interested in. I'm thinking of having a large complicated evaluation, selective search, as well as SMP in an easy-to-read generic program.

Sounds very interesting! I look forward to seeing it. One little piece of advice (which might seem obvious, but I learned it the hard way myself): A large, complicated evaluation function is fine, but make sure you write it very slowly and patiently. Otherwise you will just end up with something buggy and difficult to tune. Start with a simple and fast eval, and add one new feature at the time. Test carefully that each major new component of the eval really improves the program. I've found that even apparently obvious improvements often makes my program play worse.

Anyways, some very interesting discussion going on in here lately; glad you joined the forum.

AOL! And it's nice to see you post here again, too. I haven't seen you here recently.

Tord

by **Gerd Isenberg** » 01 Jul 2005, 11:32

diepeveen wrote:Gerd,

Can you also show us how to avoid bitboards being factor 3 times slower or so when we want to use more chessknowledge than just a simple summation count nor when we want a simplistic vector multiply.

Just simple logics for each square also using attacktables and especially how many times a square gets attacked.

A crucial thing all humans always take into account.

Note also the real old gnuchess 4.0 (non bitboards) allows using this.

Also Ed's datastructure allows it more or less in his 8 bits attacktable datastructure.

Please show me bitboards and in a generic code that also works at other processors than opteron.

I'm not using SSE2 assembly code in DIEP because i hate to be dependant upon running only with a certain windows compiler at A64's.

Diep also runs under linux and at other processors than A64's.

Please show something generic that also works at other processors than A64.

And now the advantages for bitboards that I can see:
(a) Faster evaluation. Calculating whether a pawn is passed or not requires just one and.

Difficult to read evaluation in bitboards. Every bitmask you must check and recheck 10 times. Calculation of any complex loop is dead slow.
Slow vector instructions just kill your processors IPC.

Slower evaluation than is possible in non-bitboards because you only have 1 bit of information. To do any sort of 'count' is very hard.

Try to get the total count out of 8 different bitboards at position square_c3.

In my "straightforward' datastructure i just do count = x[i]&15;

And i have that number.

(b) No need to mess around with piece lists and such, bitboards already do the job well.

In bitboards you need to write out for black and white everything, see crafty code. If you argue that's not needed to do in bitboards, then please lift your arm high in the sky indicating us whether you wrote generic code that is both working for both sides and several pieces in a generic way.

What qualifies is if you use this in the searching engine :
BITBOARD piecelist[2][6];

(c) No need to scan the empty square when generating captures.

In short all you can do in bitboards is captures?

Excuse me, if there is 40 semi legal moves, i want to generate *all* of them. There might be some move inside that raises my score a lot.

This where the lightning fast speed to scan squares gets used for my mobility, activity, scans throughout the evaluation (which eats all system time). That code is 2.2 times faster than 32 bits bitboards and like 1.5 to 64 bits bitboards.

(d) No need to test for the end of the board.

With precomputed tables there is no need to do that either to generate.

(e) Future proof: Bitboards will have a great speed advantage when all the PCs become 64 bit!

As you can prove, 32 bits to 64 bits move speeds you up around 30% (Crafty). So instead of a datastructure being factor 2.2 times slower and completely incapable to handle what todays strong programs need; that's easy datastructure that is fast and allows for little bugs

(f) bitboard programs loaded with bugs because of mistakes in conversion between 32 and 64 bits.

(g) very difficult to add more knowledge in bitboards. only the KISS principle gets used for knowledge. Complex knowledge you won't find there.

(h) the Hyatt argument, bitboards are very useful when you don't have a fast L1 cache in your processor and then search for a solution to avoid array lookups as those go to straight memory.

Good news for you, your processor DOES have a L1 cache, and a FAST one !!

(i) As a bitboarder if you lose a game, you can complain it is not your fault but from the compiler builder

Depending on the design. For rotated bitboards not that much between x86-32 versus x86-64 - may be also due to compiler are stil too "new" and the compiler guys have a lot to improve. For fill based bitboards as

Of course, compilers are made by machines who are 100% bugfree coders. In reality it is some stupidity either from you or from the compiler guysf in 32/64 bits. It doesn't matter who made the mistake. What matters is that you have to write code that works bugfree. Generic code usually is doing that. If you lose because of a compiler bug or because of some dumb casting bug you overlooked, that doesn't matter for your opponent. He'll take the point.

(i) in bitboards in the year 2005 you are still busy with hexadecimal values. Now not 8 bits long like they were in the 80s, but a confusing 64 bits in bitboards.

If you write in a normal datastructure some chessknowledge, it requires relative easy debugging to get the thing working correct. In bitboards you really must recheck things 10 times. Masks look real funny you know:

0x000290000000001000

Do you know whether the above has the right mask selected for your bitboard pattern?

Easy checking in bitboards!

In Non-bitboards things are just so much easier to write generic and if you know the coordinate system in chess pretty well, like where a2 is and that a2 != a3, then the odds for mistakes there is far smaller. If there is a problem it's identified quicker!

(j) As a bitboarder in the year 2005 you are still busy with (in)line assembly

Way to go. How to *ever* expand your program when all you are busy with is finding a clever way to do some utmost tiny thing 1 cycle faster in SSE2?

(k) As a bitboarder you just know for sure that you'll face opponents running at more processors than you do. Majority of parallel engines are *not* bitboard engines. How many parallel bitboard engines are there actually?

In order of strength: Zappa, Crafty, ... ?

Imho important is the immanent parallelism by working with sets.

Faster than being busy with parallellism within the processor is parallellism between processors!

Brings you factor 3.x speedup at a dual opteron dual core.

(l) the Frans Morsch passer argument

At how many spots in your eval do you need to know whether something is a passer. If more than 2, why not keep a bitboards array containing all the passers you got?

BITBOARD passers[2];

Real easy.

Hey, that's even faster than calculating them!

Let me quote again why i definitely do NOT want to think about dicking with bitboards, as i like to be busy with a bit more highlevel things that contain more chessknowledge than a vector multiply gives:

int dotProduct64(BitBoard bb, BYTE weights[] /* XMM_ALIGN */)
{
static const BitBoard XMM_ALIGN sbitmask[2] = {
0x8040201008040201,
0x8040201008040201
};
__m128i x0, x1, x2, x3, bm;
bm = _mm_load_si128 ( (__m128i*) sbitmask);
x0 = _mm_loadl_epi64 ( (__m128i*) &bb);
// extend bits to bytes
x0 = _mm_unpacklo_epi8 (x0, x0);
x2 = _mm_unpackhi_epi16 (x0, x0);
x0 = _mm_unpacklo_epi16 (x0, x0);
x1 = _mm_unpackhi_epi32 (x0, x0);
x0 = _mm_unpacklo_epi32 (x0, x0);
x3 = _mm_unpackhi_epi32 (x2, x2);
x2 = _mm_unpacklo_epi32 (x2, x2);
x0 = _mm_cmpeq_epi8 (_mm_and_si128 (x0, bm), bm);
x1 = _mm_cmpeq_epi8 (_mm_and_si128 (x1, bm), bm);
x2 = _mm_cmpeq_epi8 (_mm_and_si128 (x2, bm), bm);
x3 = _mm_cmpeq_epi8 (_mm_and_si128 (x3, bm), bm);
// multiply by "and" with -1 or 0
__m128i* pw = (__m128i*) weights;
x0 = _mm_and_si128 (x0, pw[0]);
x1 = _mm_and_si128 (x1, pw[1]);
x2 = _mm_and_si128 (x2, pw[2]);
x3 = _mm_and_si128 (x3, pw[3]);
// add all bytes (with saturation)
x0 = _mm_adds_epu8 (x0, x1);
x0 = _mm_adds_epu8 (x0, x2);
x0 = _mm_adds_epu8 (x0, x3);
x0 = _mm_sad_epu8 (x0, _mm_setzero_si128 ());
return _mm_extract_epi16 (x0, 0)
+ _mm_extract_epi16 (x0, 4);
}

- Gerd

Very clever the above Gerd, that you manage to do a vector operation in SSE2 of an array times a bitboard. Now the most important question is of course, does being busy for a full year with a few tiny optimizations also bring 50 rating points for Isichess?

It's obvious that each year chesssoftware has to improve in order to keep its status in the world top. Whether that's 50 ratingpoints or less is not relevant question here. But it is *some* sort of improvement.

What becomes world champion in 2003 won't win in 2004. What won in 2004 won't win in 2005. Progress is needed because competitors also improve each year.

How is bitboard engines going to manage that?

(w) Being bitboards makes it a real challenge to win the world title; after the 80s never a bitboards engine again won the world title

-----
Vincent

Vincent,

i am not a professional chess programmer and need some intellectual variety from my current 50++ hours job. Bitboards have more bitfrickling potential - and that is what count for me ;-)

Seriously, it is a matter of thinking paradigms - working setwise versus working elementwise. While you add some color and direction depending constants to get a target square of one pawn, i shift by a similar constant to get sets of target squares. All setwise operations like intersection, union, complement, remaining sets etc. are cheap bitwise instructions where an appropriate bitboard infrastructure takes advantage of.
If you have disjoint (direction and piece wise) attack sets, it is a matter of combining those sets with simple bitwise operations to get taboo or enprise sets of all kind of pieces. Of course if sets become low or single populated, things become less effiecient.

I admit that shift is not as general and generic as add, because processors usually have no generalized shift, where sign of the shift count determines the direction of the shift.

>Try to get the total count out of 8 different bitboards at position square_c3.
>In my "straightforward' datastructure i just do count = x[i]&15;

(Pre)calculating and caching is a big issue of course.
Guess your x array will take some (incremental) update as well per node.

I don't think board representation has such a big influence on engines strength. I love thinking bitboards and playing with SIMD-stuff and designing eval pattern with bitboards while i believe this is rather efficient, both with cute low level stuff and well designed oo approach.

- Gerd

by **Alessandro Scotti** » 01 Jul 2005, 14:48

Tord Romstad wrote:What I miss most after abandoning attack tables is the ability to compute safe mobility (as opposed to pseudo-legal mobility) almost for free. Perhaps I will add attack tables to Glaurung some time in the future.

Hi Tord,
how much is safe mobility important in your experience? I'm going to completely rewrite Kiwi's evaluation and this is something I'm considering to include. To avoid confusion (in my mind!!!) I usually think of "unsafe" mobility as "influence", and it seems it's useful to have it even together with safe mobility... did you have both or just the latter?

by **Tord Romstad** » 01 Jul 2005, 15:12

Hi Alessandro!

Alessandro Scotti wrote:Hi Tord,
how much is safe mobility important in your experience?

I am not sure how much it is worth in terms of playing strength (it would probably vary a lot depending on the engine anyway), but it results in an active and intelligent-looking style of play. It is probably not worth doing unless you have a very cheap way of doing it.

I'm going to completely rewrite Kiwi's evaluation and this is something I'm considering to include. To avoid confusion (in my mind!!!) I usually think of "unsafe" mobility as "influence", and it seems it's useful to have it even together with safe mobility... did you have both or just the latter?

A mix of both. With attack tables, it can be done quite simply by using a lookup table. Given the attack table entry for a square, you can find the types and numbers of all attackers of both colours. This can be used to count the number of pseudo-legal and safe moves for each piece type and colour to the given square. Compute a lookup table of mobility bonuses for all possible values of the attack table entries when the program starts up, using whatever formula you like. You can then evaluate mobility by looping through the squares of the board and use your lookup table for each square.

If you don't like big lookup tables for this kind of stuff, a nice alternative is to use a small hash table with hash keys generated from the attack table entries.

Tord

by **Daniel Shawul** » 02 Jul 2005, 05:14

Hi tord,alessandro
i tried safe mobility in scorpio but found it to be no good.
If i loop over squares twice,first for attack table generation and then for evaluating safe mobility, the slow down is unbearable even for my lazy evaling engine!
I also tried another approach.
If I generate attacktables ordered by least valuable piece IE pawns,minors..., I can do a gross safe mobility.
Exclude pawn attacked squares when calculating mobility for minors/rooks/queens, exclude minor attacked squares for rooks/queens etc.
This also doesn't work for me. May be i will try it again.
one point i missed at that time is that i should also calculate pseudo mobility and combine it with safe mobility. It is better to have all squares attacked than having none.
best
daniel

Winboard Forum

how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Re: how much open source code did you read and understand?

Who is online