Page 1 of 1

32 bit optimization

PostPosted: 24 Feb 2009, 17:38
by Onno Garms
Hello everybody,

with the release of Onno 1.0 in near future, I'm currently trying to do some 32 bit optimizations. I wrote Onno as a 64 bit engine, without caring about 32 bit performance. This results in a - think positive - speedup factor of about 1.8 for 64 bit over 32 bit. As several beta testers asked for a 32 bit version, I am now trying to make it faster. The most efficient thing to do might be to have two versions of the engine: mailbox for 32 bit and bitboard for 64 bit. But as a performant encapsulation of such different board representations seems to be impossible, this would mean to do most of the work twice. So I won't do that.

What else can be done?

What I am planning to do is:

1. For bitboards that live only on one side of the board, avoid to use the other. E.g. to test for a king on e1, use
Code: Select all
lo(bb) & 5

instead of
Code: Select all
bb & 5ui64


2. Replace magic (64 bit) multiplication by two 32 bit multiplications as described in other threads.

Unfortunately I don't see much potential in these ideas. For 1, there are not many places where to apply it. 2 might be a little more effective. But still according to my profiler output I would expect an overall speedup around 5%. Any more ideas?

Onno

Re: 32 bit optimization

PostPosted: 24 Feb 2009, 20:02
by Tord Romstad
Hello Onno!

Onno Garms wrote:Hello everybody,

with the release of Onno 1.0 in near future,


Cool! :)

Will it be GPLed?

I'm currently trying to do some 32 bit optimizations. I wrote Onno as a 64 bit engine, without caring about 32 bit performance. This results in a - think positive - speedup factor of about 1.8 for 64 bit over 32 bit. As several beta testers asked for a 32 bit version, I am now trying to make it faster. The most efficient thing to do might be to have two versions of the engine: mailbox for 32 bit and bitboard for 64 bit. But as a performant encapsulation of such different board representations seems to be impossible, this would mean to do most of the work twice. So I won't do that.


I'm in more or less the same situation as you: I've made the mistake of optimizing prematurely for 64-bit CPUs by making heavy use of bitboards and burning too many bridges, only to discover too late that performance on 32-bit CPUs is what really matters. I've pretty much concluded that rewriting with a mailbox representation is the way to go. It's technically easy, and not extremely time consuming, but so awfully boring... :(

What I am planning to do is:

1. For bitboards that live only on one side of the board, avoid to use the other. E.g. to test for a king on e1, use
Code: Select all
lo(bb) & 5

instead of
Code: Select all
bb & 5ui64


I haven't tried this, but it looks like it will get very messy in practice: If you want to test for a white king on e1, you will probably also want to test for a black king on e8. With the above approach, how would you do that with the same code for both colors?

2. Replace magic (64 bit) multiplication by two 32 bit multiplications as described in other threads.


I do this. It doesn't save a lot of time, but it's measurable, and there is almost no work involved.

Any more ideas?


I don't have anything that will help a lot, but here are two simple things which should give you at least a little speedup:

Use piece lists for each piece type and color, to avoid looping through too many bitboards in your move generators and evaluation functions. I've found this to improve the speed even in 64-bit mode, but the difference is bigger in 32-bit.

Write 32-bit optimized count_1s() functions. I use this code:

Code: Select all
#if defined(BITCOUNT_LOOP)

inline int count_1s(Bitboard b) {
  int r;
  for(r = 0; b; r++, b &= b - 1);
  return r;
}

inline int count_1s_max_15(Bitboard b) {
  return count_1s(b);
}

#elif defined(BITCOUNT_SWAR_32)

inline int count_1s(Bitboard b) {
  unsigned w = unsigned(b >> 32), v = unsigned(b);
  v = v - ((v >> 1) & 0x55555555);
  w = w - ((w >> 1) & 0x55555555);
  v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
  w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
  v = (v + (v >> 4)) & 0x0F0F0F0F;
  w = (w + (w >> 4)) & 0x0F0F0F0F;
  v = ((v+w) * 0x01010101) >> 24; // mul is fast on amd procs
  return int(v);
}

inline int count_1s_max_15(Bitboard b) {
  unsigned w = unsigned(b >> 32), v = unsigned(b);
  v = v - ((v >> 1) & 0x55555555);
  w = w - ((w >> 1) & 0x55555555);
  v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
  w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
  v = ((v+w) * 0x11111111) >> 28;
  return int(v);
}

#elif defined(BITCOUNT_SWAR_64)

inline int count_1s(Bitboard b) {
  b -= ((b>>1) & 0x5555555555555555ULL);
  b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL);
  b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
  b *= 0x0101010101010101ULL;
  return int(b >> 56);
}

inline int count_1s_max_15(Bitboard b) {
  b -= (b>>1) & 0x5555555555555555ULL;
  b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL);
  b *= 0x1111111111111111ULL;
  return int(b >> 60);
}

#endif // BITCOUNT


Tord

Re: 32 bit optimization

PostPosted: 28 Feb 2009, 19:15
by Onno Garms
Tord Romstad wrote:
Onno Garms wrote:with the release of Onno 1.0 in near future


Cool! :) Will it be GPLed?


No. I consider writing chess engines as competitive sports, like playing human chess. Making GPLed releases is like handing all your elaborated preparations to your opponent. It's OK to exchange ideas like it's OK to analyze openings with a fellow human player, but releasing the source is too much for me. Sorry.

I will try to go commercial. But I will give out many free copies by some criterium that I do not yet know precisely. Authors of strong open source engines certainly get one. Also people with whom I had discussions that helped significantly. I don't expect to sell many licenses but I find that way more tempting.

I'm in more or less the same situation as you: I've made the mistake of optimizing prematurely for 64-bit CPUs by making heavy use of bitboards and burning too many bridges, only to discover too late that performance on 32-bit CPUs is what really matters. I've pretty much concluded that rewriting with a mailbox representation is the way to go. It's technically easy, and not extremely time consuming, but so awfully boring... :(


But I think rewriting as mailbox will slow down the engine on 64 bit. So this approach would mean to write most parts of your engine twice: mailbox for 32 bit and bitboard for 64 bit computers. And I do think that this would be extremely time consuming because it's not just rewriting once but maintaining two programs simultaniously.

I haven't tried this, but it looks like it will get very messy in practice: If you want to test for a white king on e1, you will probably also want to test for a black king on e8. With the above approach, how would you do that with the same code for both colors?


When you have the color as a variable, this is in deed a problem. For this reason there were many places where I could not simplify. But for example where evaluating trapped bishops (as in 8/B1pk4/1p6/8/8/8/8/7K w - -) I had unrolled the loop over colors anyway, so I could easily only use the relevant parts of bitboards. The eval code is only marginally obfuscated by that and all 32/64-bit differences are in the file bitboard.h

Use piece lists for each piece type and color, to avoid looping through too many bitboards in your move generators and evaluation functions. I've found this to improve the speed even in 64-bit mode, but the difference is bigger in 32-bit.


When starting my engine I had such lists. But I found them not useful for speed in 64 bit mode, so I eventually removed them. I don't think I will reintroduce them.

Write 32-bit optimized count_1s() functions. I use this code:


Thanks. I have put this on my todo list.

One more idea I had is a branchless lsb function for 32 bit. Obviously it needs more instructions then an lsb function with a conditional jump, I'm not sure if it is worth is. Are there any rules of thumb how many additional instructions equate to a branch misprediction?

Re: 32 bit optimization

PostPosted: 28 Feb 2009, 21:18
by Miguel A. Ballicora
Tord Romstad wrote:Hello Onno!

Onno Garms wrote:Hello everybody,

with the release of Onno 1.0 in near future,


Cool! :)

Will it be GPLed?

I'm currently trying to do some 32 bit optimizations. I wrote Onno as a 64 bit engine, without caring about 32 bit performance. This results in a - think positive - speedup factor of about 1.8 for 64 bit over 32 bit. As several beta testers asked for a 32 bit version, I am now trying to make it faster. The most efficient thing to do might be to have two versions of the engine: mailbox for 32 bit and bitboard for 64 bit. But as a performant encapsulation of such different board representations seems to be impossible, this would mean to do most of the work twice. So I won't do that.


I'm in more or less the same situation as you: I've made the mistake of optimizing prematurely for 64-bit CPUs by making heavy use of bitboards and burning too many bridges, only to discover too late that performance on 32-bit CPUs is what really matters. I've pretty much concluded that rewriting with a mailbox representation is the way to go. It's technically easy, and not extremely time consuming, but so awfully boring... :(

What I am planning to do is:

1. For bitboards that live only on one side of the board, avoid to use the other. E.g. to test for a king on e1, use
Code: Select all
lo(bb) & 5

instead of
Code: Select all
bb & 5ui64


I haven't tried this, but it looks like it will get very messy in practice: If you want to test for a white king on e1, you will probably also want to test for a black king on e8. With the above approach, how would you do that with the same code for both colors?

2. Replace magic (64 bit) multiplication by two 32 bit multiplications as described in other threads.


I do this. It doesn't save a lot of time, but it's measurable, and there is almost no work involved.

Any more ideas?


I don't have anything that will help a lot, but here are two simple things which should give you at least a little speedup:

Use piece lists for each piece type and color, to avoid looping through too many bitboards in your move generators and evaluation functions. I've found this to improve the speed even in 64-bit mode, but the difference is bigger in 32-bit.

Write 32-bit optimized count_1s() functions. I use this code:

Code: Select all
#if defined(BITCOUNT_LOOP)

inline int count_1s(Bitboard b) {
  int r;
  for(r = 0; b; r++, b &= b - 1);
  return r;
}

inline int count_1s_max_15(Bitboard b) {
  return count_1s(b);
}

#elif defined(BITCOUNT_SWAR_32)

inline int count_1s(Bitboard b) {
  unsigned w = unsigned(b >> 32), v = unsigned(b);
  v = v - ((v >> 1) & 0x55555555);
  w = w - ((w >> 1) & 0x55555555);
  v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
  w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
  v = (v + (v >> 4)) & 0x0F0F0F0F;
  w = (w + (w >> 4)) & 0x0F0F0F0F;
  v = ((v+w) * 0x01010101) >> 24; // mul is fast on amd procs
  return int(v);
}

inline int count_1s_max_15(Bitboard b) {
  unsigned w = unsigned(b >> 32), v = unsigned(b);
  v = v - ((v >> 1) & 0x55555555);
  w = w - ((w >> 1) & 0x55555555);
  v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
  w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
  v = ((v+w) * 0x11111111) >> 28;
  return int(v);
}

#elif defined(BITCOUNT_SWAR_64)

inline int count_1s(Bitboard b) {
  b -= ((b>>1) & 0x5555555555555555ULL);
  b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL);
  b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
  b *= 0x0101010101010101ULL;
  return int(b >> 56);
}

inline int count_1s_max_15(Bitboard b) {
  b -= (b>>1) & 0x5555555555555555ULL;
  b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL);
  b *= 0x1111111111111111ULL;
  return int(b >> 60);
}

#endif // BITCOUNT


Tord


I am not sure it is a good idea to have two versions of an engine just to optimize for an obsolete OS (optimizing for mobile phones maybe a different thing altogether, though). A couple of functions here an there, maybe, but changing drastically the internal data structures will automatically reduce the usefulness of testing. Testers may detect bugs related to one version and not the other. In other words, the 64 bit version will have a lot of code that is not executed by most testers. What is the point of giving to the testers an engine that is not the "real" one?

Besides, is all this worth the trouble? I assume that this is a hobby for most of you as it is for me. In other words, I am capable of doing the most stupid an inefficient things just because I find fun in it. But, keeping and maintaining two versions just to satisfy users with an extra 10% speed sounds like heavy work and no fun for very little gain.

I started last decade with a bitboard scheme that used a structure with an array[2] of two long ints or array[4] of short ints. It was an exercise in portability. In addition, there were several functions that were naturally optimized for 32 bits. For instance, rather tha popcount the bitboard, you can popcount each 32 bit int separately. That was faster. I also had special AND, OR, NOT etc macros to do everything.
Three or four months ago, I dropped this scheme and converted everything to 64 bit ints.
Doing that, I sped up the compile in 64 bits, but I slowed down the 32 bit compile by ~10%. However, the code became so much clearer!
Clearer code compensated this 10% because now It is easier to try different things in eval.

Miguel

Re: 32 bit optimization

PostPosted: 05 Mar 2009, 22:40
by Bo Persson
Miguel A. Ballicora wrote:
I started last decade with a bitboard scheme that used a structure with an array[2] of two long ints or array[4] of short ints. It was an exercise in portability. In addition, there were several functions that were naturally optimized for 32 bits. For instance, rather tha popcount the bitboard, you can popcount each 32 bit int separately. That was faster. I also had special AND, OR, NOT etc macros to do everything.
Three or four months ago, I dropped this scheme and converted everything to 64 bit ints.
Doing that, I sped up the compile in 64 bits, but I slowed down the 32 bit compile by ~10%. However, the code became so much clearer!
Clearer code compensated this 10% because now It is easier to try different things in eval.

Miguel


Totally agree with this!

Having moved from 32 bit to 64 bit I realized that some things that were just too expensive in 32 bit code, is now actually reasonable to do. This affects things like what evaluation terms you can afford to use. Which in turn might affect what info you have to maintain.

So, having a common code base for both 32 and 64 bits is really difficult. At least one of the modes will suffer from this.

At worst, both!