Winboard Forum

by **Gerd Isenberg** » 06 Dec 2006, 21:23

Michael Sherwin wrote:Thanks Gerd,

It keeps getting better and better for the bishop! Still the rook is requireing 4 shifts. Would it be better to do one shift followed by a 32 bit multiply and then just using the indexs from the upper 32 bits with out shifting them back down?

Mike

Yes, 32-bit imul seems preferable on K8.

Your index splitting idea is really marvelous. I took me some time to get that!

I think to pre-shift-and the upper 16 bits of the folded 32-bit word is not necessary. The intended 16-bit sum might still > 0xffff, therefor a post-shift-and seems necessary. Also with your index splitting the Java array might be obsolete :

Code: Select all: u64 BishopAttacks(u64 occ, u32 sq) { u32 index; u64 bb1, bb2; occ &= bishopBits[sq]; index = (u32) occ; index += (u32)(occ >> 32) >> 1; index += index >> loBishopShifts[sq]; // assuming shifting >= 16 bb1 = loBishopAttacks[sq][index & 0xff]; bb2 = hiBishopAttacks[sq][(index >> 8) & 0xff]; return bb1 & bb2; }

Gerd

by **Gerd Isenberg** » 06 Dec 2006, 21:48

H.G.Muller wrote:
Michael Sherwin wrote:Still I have read in every paper on optimization that multiple shifts and adds to do multiplication is faster than using imul. What has changed?

I don't think anything has changed. Integer multiply is only a slow instruction on the Pentium 4 (where shifts are even slower...), which no one uses for chess. See, for instance, here.

K8 predecessor from AMD, like Athlon K7 has vector path imul/mul.

Anyway, 32-bit mode is passée, now that everyone has K8 or Core 2...

Agreed! But some programs like Diep suffer from 64-bit mode, despite bigger register files. Accessing globals becomes relativ more expensive and takes almost one additional "base"-register indirection. Signed indices, 64-bit pointer, and some 64-bit calling conventions may bother some programs as well.

by **Michael Sherwin** » 06 Dec 2006, 21:52

Gerd Isenberg wrote:
Michael Sherwin wrote:Thanks Gerd,

It keeps getting better and better for the bishop! Still the rook is requireing 4 shifts. Would it be better to do one shift followed by a 32 bit multiply and then just using the indexs from the upper 32 bits with out shifting them back down?

Mike

Yes, 32-bit imul seems preferable on K8.

Your index splitting idea is really marvelous. I took me some time to get that!

I think to pre-shift-and the upper 16 bits of the folded 32-bit word is not necessary. The intended 16-bit sum might still > 0xffff, therefor a post-shift-and seems necessary. Also with your index splitting the Java array might be obsolete :

Code: Select all
u64 BishopAttacks(u64 occ, u32 sq) { u32 index; u64 bb1, bb2; occ &= bishopBits[sq]; index = (u32) occ; index += (u32)(occ >> 32) >> 1; index += index >> loBishopShifts[sq]; // assuming shifting >= 16 bb1 = loBishopAttacks[sq][index & 0xff]; bb2 = hiBishopAttacks[sq][(index >> 8) & 0xff]; return bb1 & bb2; }

Gerd

Because, there is no edge_of_the_board bit for the bishop, shifting by >= 15 should be okay? There is one square that requires >> 15.

For rooks in 32 bit mode, how about doing a 32 imul of each half and collecting the partial indexs seperatly, rather than doing 4 lookup shifts?

by **Gerd Isenberg** » 06 Dec 2006, 22:02

Michael Sherwin wrote:For rooks in 32 bit mode, how about doing a 32 imul of each half and collecting the partial indexs seperatly, rather than doing 4 lookup shifts?

I thought you would like to outperform the 32-bit version of kindergarten-bitboards ;-)

by **Michael Sherwin** » 06 Dec 2006, 22:23

Gerd Isenberg wrote:
Michael Sherwin wrote:For rooks in 32 bit mode, how about doing a 32 imul of each half and collecting the partial indexs seperatly, rather than doing 4 lookup shifts?

I thought you would like to outperform the 32-bit version of kindergarten-bitboards

Well, yes! :mrgreen:

It is just that this idea of 32 bit imul not being very expensive anymore has got me all confused!

Besides, there can be 9 bits on one side of the board, so the indexs would have to be combined before they were split anyway.

You mentioned parallel prefix operations a couple of times. Does this refer to using both pipelines more efficiantly? :?:

Does the shifting method work in parallel better? :?:

What is the break even point between shr and imul, 1/1, 2/1, 3/1, 3/2? :?:

At this stage it would be more important to just get my folding idea up and running and play around with imul later.

by **Gerd Isenberg** » 06 Dec 2006, 22:51

It is a good idea to relax dependencies as long as we have enough regsiters. Parallel prefix operations, as H.G. already mentioned, imply a dependency chain, since each instruction is dependent on the result of the previous. On the other hand one can not ignore the win of parallel prefix operations to reduce from N expression to log2(n) - so it is always nice to mix independent instruction chains (some compiler do it quite well with inlined functions and respect to register pressure) to improve ipc.

The 32-bit kindergarten approach for rooks - also with two imuls to compare. To avoid the branch in 32-bit rankAttacks (or even 64-bit shifts), a rank-attack-getter similar to diagonals with the cost of one additional imul for (hopefully) best parallel scheduling:

Code: Select all: u64 fileAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! u32 f = sq & 7; b.b = occupied; b.l = (b.l >> f) & 0x01010101; b.h = (b.h >> f) & 0x01010101; b.l = (b.l << 4) + b.h; u32 o = (b.l * 0x10080402) >> 26; b.b = aFileAttacks[o][sq>>3]; b.h <<= f; b.l <<= f; return b.b; } u64 rankAttacks(u64 occupied, u32 sq) { union {u64 b; struct {u32 l; u32 h;};} b; // little endian! u32 f = sq & 7; b.b = occupied & rankMask[sq]; u32 o = ((b.l + b.h) * 0x02020202) >> 26; return fillUpAttacks[o][F] & rankMask[sq]; } u64 rookAttacks(u64 occupied, u32 sq) { return rankAttacks(occupied, sq) | fileAttacks(occupied, sq); }

Gerd

by **Michael Sherwin** » 07 Dec 2006, 00:36

index += index >> loBishopShifts[sq]; // assuming shifting >= 16

Hi Gerd,

Under the constant shift of (>> 32) >> 1, for the high 32, three squares f3, a6, and c6 need a low shift of 15. The good news is that in all three cases there is not a bit set for h2!

I'm going to get a lotto ticket!!

Edit: Come to think of it, none of the squares have the h2 bit set. Okay, no lotto ticket this time.

Mike

by **Michael Sherwin** » 07 Dec 2006, 04:32

Code: Select all: u64 RookAttacks(u64 occ, u32 sq) { u32 index; occ &= rookBits[sq]; index = (u32)((occ & rankMask[sq]) >> rankShifts[sq]); index += (u32)((occ & hiFileMask[sq]) >> hiFileShifts[sq]); index += ((u32) ooc & loFileMask[sq]); index += ((index & 0xffff0000) >> loFileShifts[sq]); return (rookAttacks[rookOffsets[sq] + index]); }

The biggest problem that I am having with the rook code is having to 'dance' around the rank bits. I am having to lookup mask inorder to get the info I need and move it into place. I would rather just use general immediate mask and not worry about picking up rank bits that I have already moved into place. Deleting the rank from occ after it has been moved into place in the index seems very time consuming also. one solution that occured to me of rotating it into place on the first rank before moveing it to the index seems a good solution as it would be quick to move to the index and it would be out of the way there. The _rotr function of MSVC++ 6.0 is not ansi nor unix compatible. I would imagine that there are work arounds though. What should I do?

by **Michael Sherwin** » 07 Dec 2006, 06:17

H.G.Muller wrote:
Michael Sherwin wrote:Still I have read in every paper on optimization that multiple shifts and adds to do multiplication is faster than using imul. What has changed?

I don't think anything has changed. Integer multiply is only a slow instruction on the Pentium 4 (where shifts are even slower...), which no one uses for chess. See, for instance, here.

On other Pentiums (starting with the Pentium Pro) IMUL has a throughput of one per clock, and a latency of 4. (On the Core 2 the latency is reduced to 3.) The latency is usually of little concern, since the instruction is fully pipe-lined and other instructions execute in parallel. For shifts both latency and throughput are 1, but in the sequence of shift, add, shift, add every instruction is dependent on the result of the previous one, so that the latencies simply add. The total latency is thus also 4, but you can execute fewer instructions in parallel, because it was 4 instructions in stead of a single IMUL.

It is true that a full 64 x 64-bit multiply requires 3 multiplies in 32-bit mode, but if the purpose is not to do an exact multiply but merely to shuffle scattered occupancy bits to the high-order bits, you can actually do with two 32 x 32-bit multiplies:

Code: Select all
one = MagicLo[Sqr] * (int32) Occupied; two = MagicHi[Sqr] * (int32) (Occupied>>32); // no shift in 32-bit mode index = (one+two) >> (32-KEYLEN);

This is just 3 instructions with a total latency of 7 (since the two IMULs can't start in the same clock due to throughput restrictions). But note that 64-bit shifts and ADDs/ORs also have to be implemented by two instructions in 32-bit mode. So it is very questionable if you can win much.

Anyway, 32-bit mode is passée, now that everyone has K8 or Core 2...

I did not see this very nice explaination untill now!

Microsoft is still soaking the corporations for big bucks to have 64 bit operating systems. That is why IMO, that 64 bit computers have been out for years now and no one I know has a 64 bit OS on their 64 bit machine. Unless something has changed in the last couple of weeks I still can't buy a 64 bit computer with a 64 bit OS at a local store or even from dell. Microsoft could have come out with a 64 bit windows extender, so people could execute 64 bit programs while keeping their 32 bit drivers, but they were afraid that corporations would take the cheap way rather than spending the big bucks.

MICROSOFT DOES NOT GIVE A DAMM ABOUT THE LITTLE GUY AND THEY SHOW IT EVERY CHANCE THEY GET--OTHERWISE WE ALL WOULD HAVE 64 BIT OPERATING SYSTEMS THAT HAVE 64 BIT MACHINES. I SAY TO HELL WITH MICROSOFT! :evil:

Sorry, but you touched upon a sore point with me :!:

by **Michael Sherwin** » 07 Dec 2006, 06:46

Gerd Isenberg wrote:
Michael Sherwin wrote:Thanks Gerd,

It keeps getting better and better for the bishop! Still the rook is requireing 4 shifts. Would it be better to do one shift followed by a 32 bit multiply and then just using the indexs from the upper 32 bits with out shifting them back down?

Mike

Yes, 32-bit imul seems preferable on K8.

Your index splitting idea is really marvelous. I took me some time to get that!

I think to pre-shift-and the upper 16 bits of the folded 32-bit word is not necessary. The intended 16-bit sum might still > 0xffff, therefor a post-shift-and seems necessary. Also with your index splitting the Java array might be obsolete :

Code: Select all
u64 BishopAttacks(u64 occ, u32 sq) { u32 index; u64 bb1, bb2; occ &= bishopBits[sq]; index = (u32) occ; index += (u32)(occ >> 32) >> 1; index += index >> loBishopShifts[sq]; // assuming shifting >= 16 bb1 = loBishopAttacks[sq][index & 0xff]; bb2 = hiBishopAttacks[sq][(index >> 8) & 0xff]; return bb1 & bb2; }

Gerd

Hi Gerd,

I've a question concerning the changes that you made. For lookup tables I had a single dimentional array that had variable size subarrays in it, because so many indexs were only 2,3,4,5 bits long and very few are 8 bits. Does going to a two dimentional array with fixed subarray sizes speed up the program enough to out weigh the memory bloat? I have not computed the difference in memory size yet , however, if I had to guess it would be 16 to 1 or better.

If the answer to the above is yes then would this alternate approach be better?

Code: Select all: u08 loBishopShifts[] = { 17, 17, 19, 17, 17, 18, 18, 16, 17, 17, 17, 17, 18, 18, 16, 17, 17, 17, 17, 17, 27, 15, 17, 17, 16, 16, 16, 16, 16, 16, 18, 18, 17, 17, 16, 16, 16, 16, 19, 18, 15, 17, 15, 17, 17, 17, 19, 19, 18, 20, 20, 27, 16, 16, 16, 16, 16, 17, 16, 16, 17, 17, 17, 17}; u64 BishopAttacks(u64 occ, u32 sq) { u32 index; u64 bb1, bb2; occ &= bishopBits[sq]; index = (u32) occ; index += (u32)(occ >> 32) >> 1; index += index >> loBishopShifts[sq]; bb1 = loBishopAttacks[sq][index & 0xff]; bb2 = hiBishopAttacks[sq][(index >> 8) & 0xff]; return bb1 & bb2; } /* alternate method u64 BishopAttacks(u64 occ, u32 sq) { u32 index; u08 lbp, hbp; occ &= bishopBits[sq]; index = (u32) occ; index += (u32)(occ >> 32) >> 1; index += index >> loBishopShifts[sq]; lbp = loBishopPtr[sq][index & 0xff]; hbp = hiBishopPtr[sq][(index >> 8) & 0xff]; return bishopAttacks[lbp + hbp]; }*/

lbp (low bishop pointer) and hbp (high bishop pointer) when combined would point to a perfect minimal table of attack sets. For example, a bishop on d4 has 108 possible attack sets (IIAC) and that is the exact number stored in the bishop attack table for that square. The whole table is this way! And the loBishopPtr and hiBishopPtr tables are only u08, thus greatly reducing memory usage as well as being more 32 bit friendly. What is your opinion on this.

Thanks,
Mike

by **Gerd Isenberg** » 07 Dec 2006, 08:56

Michael Sherwin wrote:Hi Gerd,

I've a question concerning the changes that you made. For lookup tables I had a single dimentional array that had variable size subarrays in it, because so many indexs were only 2,3,4,5 bits long and very few are 8 bits. Does going to a two dimentional array with fixed subarray sizes speed up the program enough to out weigh the memory bloat? I have not computed the difference in memory size yet , however, if I had to guess it would be 16 to 1 or better.

Most likely i was wrong with the homogenious two-dimensional array - if you can safe so much memory with Java arrays but two additional indirections. As always - one has to try.

Gerd

by **H.G.Muller** » 07 Dec 2006, 12:39

Beyond some point trying to save more memory becomes useless. If every square uses 8 bits, 64 squares would require 16K entries. If each entry is an 8-byte bitboard, that makes 128KB. Since you need two such tables, the demand this places on L2-space is just a bit heavier than you would like. (Unless, perhaps, you have a 4MB L2 cache). So it probably helps to use the extra indirection.

by **Michael Sherwin** » 07 Dec 2006, 12:52

Gerd Isenberg wrote:
Michael Sherwin wrote:Hi Gerd,

I've a question concerning the changes that you made. For lookup tables I had a single dimentional array that had variable size subarrays in it, because so many indexs were only 2,3,4,5 bits long and very few are 8 bits. Does going to a two dimentional array with fixed subarray sizes speed up the program enough to out weigh the memory bloat? I have not computed the difference in memory size yet , however, if I had to guess it would be 16 to 1 or better.

Most likely i was wrong with the homogenious two-dimensional array - if you can safe so much memory with Java arrays but two additional indirections. As always - one has to try.

Gerd

Ah, that is what a java array is! I was wondering what I was being given credit for having relegated to the garbage heap!

Anyway an initial rotation of the rank of occ to it's position on the first rank to not only move it, but also to get it out of the way is saving some mask lookups. The rook is starting to get better now as well!

by **Grant Osborne** » 07 Dec 2006, 22:30

Hi

Code: Select all: one = MagicLo[Sqr] * (int32) Occupied; two = MagicHi[Sqr] * (int32) (Occupied>>32); // no shift in 32-bit mode index = (one+two) >> (32-KEYLEN);

This actually works! But the table size will be bigger - at least initially because finding the 32bit magics is taking a long time.

Actually

Code: Select all: index = (one^two) >> (32-KEYLEN);

works better.

Grant

by **H.G.Muller** » 07 Dec 2006, 22:52

Grant Osborne wrote:This actually works!

Of course! :wink:

Is it hard to find just any magic at all, or only for sub-minimal keys (e.g. 11-bit for a corner Rook)?

I would expect it is easier than with a normal 64-bit multiply, because you can align the bits in the low and high half of the bitboard independently, without dragging those in the other half along.

Besides, like you already found, you have more freedom how to combine the final result of the two multiplies (add, sub, xor, perhaps or).

by **Grant Osborne** » 07 Dec 2006, 23:04

When I was previously looking for keys, I developed a sort of 'homing in' technique to find them. This was how I managed to find 11-bit rook keys for some corners. This method does not work when looking for 2 magics to merge together, so I guess that a lot of the keys will be difficult to find.

I've started off by finding some 5/6 bit keys for the bishop e.g.

A8 hi 12882101 lo 4907CF43 6 bit
B8 hi D18D17FD lo 12C7A7F1 5 bit

Rooks are mainly 12 bit for the few I've found.

Grant

by **H.G.Muller** » 07 Dec 2006, 23:51

OK.

I only tried finding keys once (just for fun, because I don't use bitboards in my engine), and I did not have any special algorithm to do it. Just try random numbers until you find one that produces no collisions.

My experience was that keys of the theoretical length or longer were quickly found, while I never succeeded in finding a subminimal key.

I don't see an off-hand reason why it should be more difficult for this prescription.

by **Grant Osborne** » 08 Dec 2006, 00:34

H.G.

The bishop table I can create is 80Kb. The centre 4 squares are 10-bit and the squares surrounding them are 8-bit and the rest 6-bit. Strange. Like you, I expected a smaller table. Even so, could these 32-bit multiplies with a slightly larger table execute faster than Pradu's superfast code?

Grant

by **H.G.Muller** » 08 Dec 2006, 11:31

There must be something wrong with your key generator. I can make better keys by hand!

e.g. for a center square:

Code: Select all: occupancy bits: ........ high word .A...... ..B...E. ...C.D.. ........ low word ...3.1.. ..4...2. ........ ........ hi<<1 A....... .B...E.. ..C.D... B...E... hi<<18 .C.D.... ........ ........ .....C.D hi<<22 ........ ........ ........ B...EC.D hi*0x440002 AC.D.... .B...E.. ..C.D... .3.1.... lo<<10 4...2... ........ ........ ..4...2. lo<<16 ........ ........ ........ .341..2. lo*0x10400 4...2... ........ ........ B341EC2D total = hi*0x440002 ^ lo*0x10400 XC.D2... (X=A^4) .B...E.. ..C.D...

The only collission is between A and 4, and since there is an uncontaminated copy of 4 in b61, this can be unambiguously resolved. So the high-order 9 bits are a unique key.

by **Grant Osborne** » 08 Dec 2006, 12:33

Yes I agree. Generating 2 random 32-bit numbers is not proving to be very effective. I'll try a different method.

Grant

Winboard Forum

Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Who is online