Winboard Forum

by **Michael Sherwin** » 03 Dec 2006, 02:05

Code: Select all: u08 hiBishopShifts[] = { 35, 37, 38, 32, 32, 33, 33, 33, 35, 35, 37, 38, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 35, 37, 33, 33, 33, 33, 33, 33, 33, 37, 33, 33, 33, 33, 33, 35, 33, 37, 33, 33, 33, 33, 33, 33, 35, 35}; uo8 loBishopShifts[] = { 16, 17, 19, 19, 17, 17, 18, 16, 16, 16, 16, 17, 18, 18, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 16, 16, 16, 16, 16, 16, 18, 18, 17, 17, 17, 17, 16, 16, 17, 21, 17, 17, 15, 17, 17, 17, 19, 17, 18, 20, 20, 20, 17, 16, 16, 17, 16, 17, 16, 16, 16, 17, 16, 16}; u64 BishopAttacks(u64 occ, u32 sq) { u32 loc; u32 hoc; occ &= bishopBits[sq]; hoc = (u32)(occ >> hiBishopShifts[sq]); loc = ((u32)occ | hoc); loc = (loc & 0xffff) | (loc >> loBishopShifts[sq]); return bishopAttackSet[bishopOffsets[64] + loc]; }

This is the fastest expression of the code, however, it requires an enormous lookup table--many megabytes. Memory is cheap and speed is well, speed! I believe that modern computers will run fastest with the code as is. It is a starting place. An obvious memory saving modification is to split the 11 - 15 bit index loc into two eight bit indexs. Then on 32 bit machines retrive two index fragments and by oring them together form an index into the bishop attack set table. On 64 bit machines just retrieve two possible attack sets and by anding them together the correct attack set is formed!

by **Gerd Isenberg** » 03 Dec 2006, 12:01

Michael Sherwin wrote:
Code: Select all
u08 hiBishopShifts[] = { 35, 37, 38, 32, 32, 33, 33, 33, 35, 35, 37, 38, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 35, 37, 33, 33, 33, 33, 33, 33, 33, 37, 33, 33, 33, 33, 33, 35, 33, 37, 33, 33, 33, 33, 33, 33, 35, 35}; uo8 loBishopShifts[] = { 16, 17, 19, 19, 17, 17, 18, 16, 16, 16, 16, 17, 18, 18, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 16, 16, 16, 16, 16, 16, 18, 18, 17, 17, 17, 17, 16, 16, 17, 21, 17, 17, 15, 17, 17, 17, 19, 17, 18, 20, 20, 20, 17, 16, 16, 17, 16, 17, 16, 16, 16, 17, 16, 16}; u64 BishopAttacks(u64 occ, u32 sq) { u32 loc; u32 hoc; occ &= bishopBits[sq]; hoc = (u32)(occ >> hiBishopShifts[sq]); loc = ((u32)occ | hoc); loc = (loc & 0xffff) | (loc >> loBishopShifts[sq]); return bishopAttackSet[bishopOffsets[64] + loc]; }

This is the fastest expression of the code, however, it requires an enormous lookup table--many megabytes. Memory is cheap and speed is well, speed! I believe that modern computers will run fastest with the code as is. It is a starting place. An obvious memory saving modification is to split the 11 - 15 bit index loc into two eight bit indexs. Then on 32 bit machines retrive two index fragments and by oring them together form an index into the bishop attack set table. On 64 bit machines just retrieve two possible attack sets and by anding them together the correct attack set is formed!

This folding down seems an interesting parallel prefix alternative to the mul/shift right approach and reminds me on Walter Faxon's bitscan. May be it is better to use "add" or "sub" instead of "or" - and to eventually provide a final "and" for the index.

Code: Select all: occ &= bishopBits[sq]; hoc = (u32)(occ >> hiBishopShifts[sq]); loc = (u32)occ + hoc; loc = (loc + (loc >> loBishopShifts[sq])) & maskFinalBishopIndex[sq];

Gerd

by **Michael Sherwin** » 03 Dec 2006, 12:52

Hi Gerd,

Thanks for the suggestion. I will give it some study. I am confused over it right now. Maybe after I get some sleep I will be able to comprehend better.

Mike

by **Michael Sherwin** » 04 Dec 2006, 23:04

Gerd Isenberg wrote:
Michael Sherwin wrote:
Code: Select all
u08 hiBishopShifts[] = { 35, 37, 38, 32, 32, 33, 33, 33, 35, 35, 37, 38, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 35, 37, 33, 33, 33, 33, 33, 33, 33, 37, 33, 33, 33, 33, 33, 35, 33, 37, 33, 33, 33, 33, 33, 33, 35, 35}; uo8 loBishopShifts[] = { 16, 17, 19, 19, 17, 17, 18, 16, 16, 16, 16, 17, 18, 18, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 16, 16, 16, 16, 16, 16, 18, 18, 17, 17, 17, 17, 16, 16, 17, 21, 17, 17, 15, 17, 17, 17, 19, 17, 18, 20, 20, 20, 17, 16, 16, 17, 16, 17, 16, 16, 16, 17, 16, 16}; u64 BishopAttacks(u64 occ, u32 sq) { u32 loc; u32 hoc; occ &= bishopBits[sq]; hoc = (u32)(occ >> hiBishopShifts[sq]); loc = ((u32)occ | hoc); loc = (loc & 0xffff) | (loc >> loBishopShifts[sq]); return bishopAttackSet[bishopOffsets[64] + loc]; }

This is the fastest expression of the code, however, it requires an enormous lookup table--many megabytes. Memory is cheap and speed is well, speed! I believe that modern computers will run fastest with the code as is. It is a starting place. An obvious memory saving modification is to split the 11 - 15 bit index loc into two eight bit indexs. Then on 32 bit machines retrive two index fragments and by oring them together form an index into the bishop attack set table. On 64 bit machines just retrieve two possible attack sets and by anding them together the correct attack set is formed!

This folding down seems an interesting parallel prefix alternative to the mul/shift right approach and reminds me on Walter Faxon's bitscan. May be it is better to use "add" or "sub" instead of "or" - and to eventually provide a final "and" for the index.
Code: Select all
occ &= bishopBits[sq]; hoc = (u32)(occ >> hiBishopShifts[sq]); loc = (u32)occ + hoc; loc = (loc + (loc >> loBishopShifts[sq])) & maskFinalBishopIndex[sq];

Gerd

Hi Gerd,

I think that I understand your suggestion now!

1. Use '+' when possible to not over burden the logical engine of the processor and adding

may just be faster anyway!?

2. Do a final masking to get rid of any spurious bits!?

I could be wrong, but, I did select the shift values carefully to avoid any spurious bits. I

will double check my logic to see if that is true.

Here anyway, is the rook attack set getter and a new bishop ... getter that is designed after

the rook ... getter.

I will add the data to drive them and an explaination when I get a chance.

Code: Select all: u64 RookAttacks(u64 occ, u32 sq) { u32 index; occ &= rookBits[sq]; index = (u32)((occ & rankMask[sq]) >> rankShifts[sq]); index += (u32)((occ & hiFileMask[sq]) >> hiFileShifts[sq]); index += ((u32) ooc & loFileMask[sq]); index += ((index & 0xffff0000) >> loFileShifts[sq]); return (rookAttacks[rookOffsets[sq] + index]); } u64 BishopAttacks(u64 occ, u32 sq) { u32 index; occ &= bishopBits[sq]; index = (u32)(occ) + (u32)(occ >> hiBishopShifts[sq]) + ((u32)occ & 0xffff0000) >> loBishopShifts[sq]); return (bishopAttacks[bishopOffsets[sq] + index]); }

Hi all,

These are baseline functions to be used as a starting point. All indexs generated are

contained in 16 bits or less, however, they still require huge amounts of ram for the attack

tables. Memory compression techniques will have to be added if you want to minimise memory

usage. I have thought of at least a dozen ways to greatly reduce memory usage with only a

very slight slowdown of the algorithm. I did not want to cloud the basic algorithm by including

any compression methods in the code. However, I have mentioned several already in this

thread--some of wich may actually work. I hope that programmers will like what I have already

done and want to contribute some good compression ideas. Of course, if I am deluded and my

code is crap, I am sure that someone will let me know--I welcome that!

Thanks,
Mike

by **Gerd Isenberg** » 05 Dec 2006, 00:55

If you fold several masked rays down by shift-or, you cannot distinguish whether a 1 in the result comes from 1|0, 0|1 or 1|1, which might be possible with add/sub. If you look to Pradu's constants you'll find almost factors with four or five bit set. Two parallel-prefix shifts

Code: Select all: x += x >> A(sq) x += x >> B(sq)

might work like a multiplication with a factor with four one bits and final shift right. Therefor the vague idea to get similar table sizes as Pradu and to finally mask with (1<<N(sq))-1 with N = 7..13 (but not 16) depending on the bits needed for each individual square.

Gerd

by **Michael Sherwin** » 05 Dec 2006, 01:39

Gerd Isenberg wrote:If you fold several masked rays down by shift-or, you cannot distinguish whether a 1 in the result comes from 1|0, 0|1 or 1|1, which might be possible with add/sub. If you look to Pradu's constants you'll find almost factors with four or five bit set. Two parallel-prefix shifts
Code: Select all
x += x >> A(sq) x += x >> B(sq)
might work like a multiplication with a factor with four one bits and final shift right. Therefor the vague idea to get similar table sizes as Pradu and to finally mask with (1<<N(sq))-1 with N = 7..13 (but not 16) depending on the bits needed for each individual square.

Gerd

Hi Gerd,

Something else to ponder over for several hours untill I can understand it!

It seems that you are talking about collisions in the folding down process.

I checked every square by hand for both the bishop and the rook and every bit is preserved and maps to an exact bit in the index. However, each square needs its own subtable as indexs for squares are not unique for the entire set of squares. Here is one compression technique to shrink the tables down to a fraction of what is needed. Split the index in half into an 8 bit high and an 8 bit low. Use each index to retrieve a possible bitboard based on which bits are set, i.e. they do not preclude certain bits from being part of the attack set, so those bits are all included in the possible attack set. Do the same for the other 8 bit partial index and then & the two possible attack sets together and only the bits that are possible in both sets remain. I have not tallied the size of the subtables needed for each square in each high and low table, however, worst case is:
all 8 bits needed for each index 256 * 2 (some partial indexs are only two bits!)
There are 64 squares 256 * 2 * 64
8 bytes per attack set 256 * 2 * 64 * 8
two types of pieces B and R 256 * 2 * 64 * 8 * 2 = 524.288 bytes for the attack tables--worst case. The true size will be only a small fraction as very few indexs are 8 bits long, many are 3,4,5.6 bits long, requireing very small subtables.

Okay using plus to combine shiftted patterns bothers me a little, but if there are no bit collisions then there are no carries and plus should work like | or even ^. I will switch back to | and only then try + after it is working!

Mike

by **Michael Sherwin** » 05 Dec 2006, 03:07

Well that was one of my denser moments. :mrgreen:

I thought that the bit collisions was a criticism. Now I see that it is the "vauge idea" at memory compression do to bit propagaton through the use of addition and that you were responding to my plea for compression ideas. Thanks! So after all the bits have been shifted into place a custom addition may be performed on them to propagate a few bits downward, if they exist. Therefore the shift patterns can be selected that will bunch most of the bits toward the bottom, leaving a few non adjacent bits straggleing higher up.
Then if those bits are actually present they can be propagated downward by adding 1 or more adjacent bits!

Or you could retrieve two indexs with the two partial high and low indexs that when combined point to a perfect table of minimum size.

And both methods could be combined to create extreamly small data requirements.

You know, they should have a foot in mouth smiley--I would be using that one a lot!

by **Pradu** » 05 Dec 2006, 13:09

I have a couple of doubts on this piece of code:

Michael Sherwin wrote:
Code: Select all
loc = ((u32)occ | hoc);

+ and | won't work when collisions in low/high-occupancy happen because one looses required information. For example take rook on B1 and we only take the cases of when all occupancy bits are zero and we only have bits B2 and B6 variable. Both these will collide on B2. So this means that you will get the same attack bitboard either when B2 or B6 are 1 which is certainly not the case.

It is necessary that every occupancy bit has a different map on the index. One way to achieve this is to have each occupancy bit on it's own square but it is possible for 1 bit be related to multiple bits in the index, thereby creating possibilities to have say 12 bits mapping to 9.

In any case the method will not work as it is using either adds or ors. A fix would be to use multiple variable shifts to the occupancy ((low<< or >>something) | (high<< or >>something)). This can even work for rooks but you might need to variable shift multiple times ( low + (high>>x) + (high>>y) ). The partial tables are also a nice idea and this is a [64][2^16] (8MB) and you can also extract the actual partial-attack bitboards from these tables instead of indexing bits.

by **Michael Sherwin** » 05 Dec 2006, 18:15

Hi Pradu,

Fantastic, that you are here!

Pradu wrote:I have a couple of doubts on this piece of code:

Michael Sherwin wrote:
Code: Select all
loc = ((u32)occ | hoc);

+ and | won't work when collisions in low/high-occupancy happen because one looses required information. For example take rook on B1 and we only take the cases of when all occupancy bits are zero and we only have bits B2 and B6 variable. Both these will collide on B2. So this means that you will get the same attack bitboard either when B2 or B6 are 1 which is certainly not the case.

loc = ((u32)occ | hoc);
This code was for bishops after hoc had been shifted to the opposite color square. Bishop code has been changed anyway.

For a rook on b1 the rook code does this

Get the rank from occ and shift it to a1
get hi bits - b5,b6,b7 from occ and shift to a2
get a3,b3,a4,b4 from index and shift to f1

See, no collisions!

It is necessary that every occupancy bit has a different map on the index. One way to achieve this is to have each occupancy bit on it's own square but it is possible for 1 bit be related to multiple bits in the index, thereby creating possibilities to have say 12 bits mapping to 9.

This never happens!

In any case the method will not work as it is using either adds or ors. A fix would be to use multiple variable shifts to the occupancy ((low<< or >>something) | (high<< or >>something)). This can even work for rooks but you might need to variable shift multiple times ( low + (high>>x) + (high>>y) ). The partial tables are also a nice idea and this is a [64][2^16] (8MB) and you can also extract the actual partial-attack bitboards from these tables instead of indexing bits.

The method works. Everything maps cleanly.

Your magic can also split the index, I think, and achive much smaller tables!

The split indexs are only 8 bits each maximum. [64][2 ^ 8] !

I hope that you hang around for a while!

Mike

by **Michael Sherwin** » 05 Dec 2006, 20:09

I did not realize that negative shifts could not be done.

Why not intel? :evil:

Motion is relative to the observer, so if the the observer is going right faster than the object being observed then the object appears to the observer to be shifting it's position negativly right. Seems like a valid concept to me.

Anyway the only two negative shifts needed were for a rook on g1 & h1 as the ranks for those squares with this code needed a negative shift.

This code change requires no negative shifts.

Code: Select all: u64 RookAttacks(u64 occ, u32 sq) { u32 index; occ &= rookBits[sq]; index = (u32)((occ & rankMask[sq]) >> ((sq & ~7) + 1); index += (u32)((occ & hiFileMask[sq]) >> hiFileShifts[sq]); index += ((u32) ooc & loFileMask[sq]) >> loFileShifts[sq]; index += ((index & 0xffff0000) >> loQuadShifts[sq]); return (rookAttacks[rookOffsets[sq] + index]); }

I am going to look for some way to reduce this code.

by **H.G.Muller** » 05 Dec 2006, 20:45

Michael Sherwin wrote:This is the fastest expression of the code, however, it requires an enormous lookup table--many megabytes. Memory is cheap and speed is well, speed!

I believe you don't have your priorities right here. Memory is cheap in dollars, but memory is outrageously expensive in terms of access time. So big tables, that do not fit in the cache (at least L2) at all times, are killing. Better to do a hundred adds, ors, shifts or multiplies than a single access that misses the cache. Better to do 200... My 1.3 GHz Pentium-M, which, by todays standards is a very slow machine, takes ~286 clocks for a DRAM access, and it can do 2 instructions per clock. A 2GHz Core 2 does 3 instructions per clock, and might require 440 clocks, since the memory hasn't got much faster. That table lookup wastes the opportunity to do 1320 instructons!

Big tables are an absolute nono in modern computing. The best way to speed up a program by a factor 10 is to reduce its memory usage and improve its cache hit rate.

by **Michael Sherwin** » 05 Dec 2006, 23:20

H.G.Muller wrote:
Michael Sherwin wrote:This is the fastest expression of the code, however, it requires an enormous lookup table--many megabytes. Memory is cheap and speed is well, speed!

I believe you don't have your priorities right here. Memory is cheap in dollars, but memory is outrageously expensive in terms of access time. So big tables, that do not fit in the cache (at least L2) at all times, are killing. Better to do a hundred adds, ors, shifts or multiplies than a single access that misses the cache. Better to do 200... My 1.3 GHz Pentium-M, which, by todays standards is a very slow machine, takes ~286 clocks for a DRAM access, and it can do 2 instructions per clock. A 2GHz Core 2 does 3 instructions per clock, and might require 440 clocks, since the memory hasn't got much faster. That table lookup wastes the opportunity to do 1320 instructons!

Big tables are an absolute nono in modern computing. The best way to speed up a program by a factor 10 is to reduce its memory usage and improve its cache hit rate.

Hi H.G.,

Good thing that the compression I am working on is really good!

The theory of table sizes and cash hits is paradoxial. It all depends on where and when a program samples huge tables called the 'sampleing pattern'. If the sampling is random and evenly distributed across the entire memory space then what you point out is the 'rule rather than the exception'. On the other hand if the sampleing is clumped and only rarely samples outside of those clumps then what you say is the 'exception rather than the rule'. If a program 'clumps' well then increasing the memory size makes very little difference. Lucky for us chess programmers the position on the board changes very little at each node, so chess programs tend to 'clump' very well. When pieces come off the board then even the 'clumps' are reduced in number. Huge tables to allow faster code will actually speed up the endgame over smaller tables and slower code. One very huge table that does not 'clump' very well in a chess program is the position hash as the accesses tend to be distributed all over the place.

Mike

by **H.G.Muller** » 05 Dec 2006, 23:33

Positon hash is indeed a disaster. I get better performance if I store the end leaves in a separate small table of 128 KB, then when I store them in the main hash. Despite the fact that this reduces the number of hash hits by a factor 2. Doubling the size of this table to 256KB only slows things down. I tried using a bigger tabel with 'hash-key focusing' to cause better clumping, but so far I could not make that work.

by **Michael Sherwin** » 06 Dec 2006, 00:47

H.G.Muller wrote:Positon hash is indeed a disaster. I get better performance if I store the end leaves in a separate small table of 128 KB, then when I store them in the main hash. Despite the fact that this reduces the number of hash hits by a factor 2. Doubling the size of this table to 256KB only slows things down. I tried using a bigger tabel with 'hash-key focusing' to cause better clumping, but so far I could not make that work.

What if for the last two ply there was no hash at all, but a pure vector table of all two move possibilities? That way there would be a lot of clumping for the last two most expensive ply. And the main hash can be made smaller!

Edit: I guess that this wont work, because it would take a different table for every depth 3 position. However, since the depth 3 move is known it would still catch endleaf transpositions from depth 3.

by **Michael Sherwin** » 06 Dec 2006, 12:45

Michael Sherwin wrote:
Code: Select all
u08 hiBishopShifts[] = { 35, 37, 38, 32, 32, 33, 33, 33, 35, 35, 37, 38, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 35, 37, 33, 33, 33, 33, 33, 33, 33, 37, 33, 33, 33, 33, 33, 35, 33, 37, 33, 33, 33, 33, 33, 33, 35, 35}; uo8 loBishopShifts[] = { 16, 17, 19, 19, 17, 17, 18, 16, 16, 16, 16, 17, 18, 18, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 16, 16, 16, 16, 16, 16, 18, 18, 17, 17, 17, 17, 16, 16, 17, 21, 17, 17, 15, 17, 17, 17, 19, 17, 18, 20, 20, 20, 17, 16, 16, 17, 16, 17, 16, 16, 16, 17, 16, 16}; u64 BishopAttacks(u64 occ, u32 sq) { u32 loc; u32 hoc; occ &= bishopBits[sq]; hoc = (u32)(occ >> hiBishopShifts[sq]); loc = ((u32)occ | hoc); loc = (loc & 0xffff) | (loc >> loBishopShifts[sq]); return bishopAttackSet[bishopOffsets[64] + loc]; }

This is the fastest expression of the code, however, it requires an enormous lookup table--many megabytes. Memory is cheap and speed is well, speed! I believe that modern computers will run fastest with the code as is. It is a starting place. An obvious memory saving modification is to split the 11 - 15 bit index loc into two eight bit indexs. Then on 32 bit machines retrive two index fragments and by oring them together form an index into the bishop attack set table. On 64 bit machines just retrieve two possible attack sets and by anding them together the correct attack set is formed!

Code: Select all: u64 BishopAttacks(u64 occ, u32 sq) { u32 index; u64 bb1; u64 bb2; occ &= bishopBits[sq]; index = (u32)(occ) + (u32)(occ >> 33) index += (index & 0xffff0000) >> loBishopShifts[sq]); bb1 = (loBishopAttacks[loBishopOffsets[sq] + (index & 0xff)]; bb2 = (hiBishopAttacks[hiBishopOffsets[sq] + (index >> 8)]; return (bb1 & bb2); }

This will be the code for the first beta test.

It shifts the upper 4 ranks down to the lower 4 ranks and right 1 bit to avoid collisions. Then the 2nd and 3rd ranks down by a lookup shift. Thus two (8 - unused bits to the left) indexs are created that access two differant lookup tables of possible bits in the attack set. The two attack sets are then anded together to form the real attack set.

Note: Edge squares for the bishop can never be considered to be blockers.

This is why the anding works:

Lets say that a bishop on d4 is totally blocked by just 4 bits in the pattern of blockers. Now imagine 1 bit being reppresented in the one index and the other 3 bits in the other index.

Then bb1 may look like this:

Code: Select all: . . . . . . . x x . . . . . x . . x . . . x . . . . x . x . . . . . . . . . . . . . x . x . . . . . . . . x . . . . . . . . x .

And bb2 would look like this:

Code: Select all: . . . . . . . . . . . . . . . . . . . . . . . . . . x . x . . . . . . . . . . . . . x . x . . . . x . . . . . . x . . . . . . .

bb1 & bb2 is then:

Code: Select all: . . . . . . . . . . . . . . . . . . . . . . . . . . x . x . . . . . . . . . . . . . x . x . . . . . . . . . . . . . . . . . . . . .

Which is the correct attack set to return.

In the next few days I will do the initialization and testing of the bishop and report the results.

by **H.G.Muller** » 06 Dec 2006, 13:52

It seems to me you have to do

Code: Select all: index = (u32)(occ) + (u32)(occ >> 33); index += ((u32)index & 0xffff0000) >> loBishopShifts[sq]);

or you would lose the upper 16 bits.

Doing the lookup in two parts seems a very competitive way of doing it. I am still not convinced why you would prefer the repeated shifting and adding over a multiply. Most CPUs have only one ALU that can do shifts (and it is the same ALU that can do multiplý). So what do you hope to gain over

Code: Select all: index = (u32)(occ * BishopMagic[sq] >> loBishopShifts[sq]);

?

You could still apply the split lookup. Such a split lookup might even work for generating the complete attack board for a Queen, since the squares that require a large index for a Rook (the corners) require a small index for the Bishop, and vice versa. The worst are the center squares which have 9 Bishop blockers and 10 Rook blockers, so 19 in total, and you could split that into a 10 and a 9-bit index. With powerfull-enough magic you might be lucky (since only 4 squares need that large an index) and have 9+9 for any square. Even without such luck, minimizing the tables per square would only lead to 38.5k entries (for Q!, compare this to 100k entries for R with unsplit lookup).

by **Michael Sherwin** » 06 Dec 2006, 20:01

H.G.Muller wrote:It seems to me you have to do
Code: Select all
index = (u32)(occ) + (u32)(occ >> 33); index += ((u32)index & 0xffff0000) >> loBishopShifts[sq]);

or you would lose the upper 16 bits.

Yes, I botched the code by copying it from the wrong paper!
I will edit it to the correct code. Sorry! :mrgreen:

Here is how the code was intended to be:

Code: Select all: u64 BishopAttacks(u64 occ, u32 sq) { u32 index; u64 bb1; u64 bb2; occ &= bishopBits[sq]; index = (u32)(occ) + (u32)(occ >> 33) index += (index & 0xffff0000) >> loBishopShifts[sq]); bb1 = (loBishopAttacks[loBishopOffsets[sq] + (index & 0xff)]; bb2 = (hiBishopAttacks[hiBishopOffsets[sq] + (index >> 8)]; return (bb1 & bb2); }

Oh! It's just like you said. Except there is no need for the (u32) type cast as index is already u32.

Doing the lookup in two parts seems a very competitive way of doing it.

Thanks! It is a small price to pay for such small tables!

I am still not convinced why you would prefer the repeated shifting and adding over a multiply. Most CPUs have only one ALU that can do shifts (and it is the same ALU that can do multiplý). So what do you hope to gain over
Code: Select all
index = (u32)(occ * BishopMagic[sq] >> loBishopShifts[sq]);

?

Gerd has said over and over again about how slow 64 bit multiply is on a 32 bit machine. My whole reason to do it this way was to avoid that multiplication. What about shifting the upper half into the lower half first, followed by a 32 bit multiply second. This might reduce the work done in the rook code even more. Still I have read in every paper on optimization that multiple shifts and adds to do multiplication is faster than using imul. What has changed?

You could still apply the split lookup. Such a split lookup might even work for generating the complete attack board for a Queen, since the squares that require a large index for a Rook (the corners) require a small index for the Bishop, and vice versa. The worst are the center squares which have 9 Bishop blockers and 10 Rook blockers, so 19 in total, and you could split that into a 10 and a 9-bit index. With powerfull-enough magic you might be lucky (since only 4 squares need that large an index) and have 9+9 for any square. Even without such luck, minimizing the tables per square would only lead to 38.5k entries (for Q!, compare this to 100k entries for R with unsplit lookup).

Under the right circumstances doing just one getter for the queen and extracting the rook or bishop might be awsomely fast!

by **Gerd Isenberg** » 06 Dec 2006, 20:30

To avoid 64-bit shifts for 32-bit mode, i suggest:

u32 loc = (u32) occ;
u32 hoc = (u32)(occ >> 32); // not a shift in 32-bit mode
u32 index = loc + (hoc >> 1);
...

Yes, 64-bit imul is a rather expensive call in 32-bit mode, with conditions and three 32-bit imuls/mul. See:

25112 Rev. 3.06 September 2005 Software Optimization Guide for AMD64 Processors

General 64-Bit Optimizations page 61

Example 2
To perform the low-order half of the product of two 64-bit integers using 32-bit registers, a procedure such as the following is necessary:
; In: [ESP+8]:[ESP+4] = multiplicand
; [ESP+16]:[ESP+12] = multiplier
; Out: EDX:EAX = (multiplicand * multiplier) % 2^64
; Destroys: EAX, ECX, EDX, EFlags
llmul PROC
mov edx, [esp+8] ; multiplicand_hi
mov ecx, [esp+16] ; multiplier_hi
or edx, ecx ; One operand >= 2^32?
mov edx, [esp+12] ; multiplier_lo
mov eax, [esp+4] ; multiplicand_lo
jnz twomul ; Yes, need two multiplies.
mul edx ; multiplicand_lo * multiplier_lo
ret ; Done, return to caller.
twomul:
imul edx, [esp+8] ; p3_lo = multiplicand_hi * multiplier_lo
imul ecx, eax ; p2_lo = multiplier_hi * multiplicand_lo
add ecx, edx ; p2_lo + p3_lo
mul dword ptr [esp+12] ; p1 = multiplicand_lo * multiplier_lo
add edx, ecx ; p1 + p2_lo + p3_lo = result in EDX:EAX
ret ; Done, return to caller.
llmul ENDP

.............

Integer Optimizations Page 164

8.2 Alternative Code for Multiplying by a Constant
....
There are improvements in the AMD Athlon 64 and AMD Opteron processors’ multiplier over that of previous x86 processors. For this reason, when doing 32-bit multiplication, only use the alternative sequence if the alternative sequence has a latency that is less than or equal to 2 cycles. For 64-bit multiplication, only use the alternative sequence if the alternative sequence has a latency that is less than or equal to 3 cycles.
Examples
by 2: add reg1, reg1 ; 1 cycle
by 3: lea reg1, [reg1+reg1*2] ; 2 cycles
by 4: shl reg1, 2 ; 1 cycle
by 5: lea reg1, [reg1+reg1*4] ; 2 cycles

Gerd

by **Michael Sherwin** » 06 Dec 2006, 20:57

Thanks Gerd,

It keeps getting better and better for the bishop! Still the rook is requireing 4 shifts. Would it be better to do one shift followed by a 32 bit multiply and then just using the indexs from the upper 32 bits with out shifting them back down?

Mike

by **H.G.Muller** » 06 Dec 2006, 21:16

Michael Sherwin wrote:Still I have read in every paper on optimization that multiple shifts and adds to do multiplication is faster than using imul. What has changed?

I don't think anything has changed. Integer multiply is only a slow instruction on the Pentium 4 (where shifts are even slower...), which no one uses for chess. See, for instance, here.

On other Pentiums (starting with the Pentium Pro) IMUL has a throughput of one per clock, and a latency of 4. (On the Core 2 the latency is reduced to 3.) The latency is usually of little concern, since the instruction is fully pipe-lined and other instructions execute in parallel. For shifts both latency and throughput are 1, but in the sequence of shift, add, shift, add every instruction is dependent on the result of the previous one, so that the latencies simply add. The total latency is thus also 4, but you can execute fewer instructions in parallel, because it was 4 instructions in stead of a single IMUL.

It is true that a full 64 x 64-bit multiply requires 3 multiplies in 32-bit mode, but if the purpose is not to do an exact multiply but merely to shuffle scattered occupancy bits to the high-order bits, you can actually do with two 32 x 32-bit multiplies:

Code: Select all: one = MagicLo[Sqr] * (int32) Occupied; two = MagicHi[Sqr] * (int32) (Occupied>>32); // no shift in 32-bit mode index = (one+two) >> (32-KEYLEN);

This is just 3 instructions with a total latency of 7 (since the two IMULs can't start in the same clock due to throughput restrictions). But note that 64-bit shifts and ADDs/ORs also have to be implemented by two instructions in 32-bit mode. So it is very questionable if you can win much.

Anyway, 32-bit mode is passée, now that everyone has K8 or Core 2...

Winboard Forum

Sherwin Index II, The Sequal

Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Re: Sherwin Index II, The Sequal

Who is online