Winboard Forum

by **Michael Sherwin** » 05 Dec 2006, 23:51

If anyone from intel/amd reads here or somebody that knows somebody at intel/amd then they need to get the word out that there should be a new instruction included in future chips. This instruction would be very powerful and useful in any aplication that needs to compress scattered bits into a smaller space or just move bits around.

It would work like this:

Given a source mask and a destination mask, apply the source mask to a bit pattern in a register, so that the first bit common to the source mask and the bit pattern is set in the corresponding position (determined by the destination mask) of a register. So bit 7 of the source mask will be mapped to bit 7 of the destination mask if it exist in the bit pattern.

Please, please, please, please!

. Not that I'm begging! :mrgreen:

by **H.G.Muller** » 06 Dec 2006, 00:13

I would have much more use for instructions that rotate a bitboard by n*45 degrees.

by **Michael Sherwin** » 06 Dec 2006, 01:39

H.G.Muller wrote:I would have much more use for instructions that rotate a bitboard by n*45 degrees.

Are you sure? With what I suggested an index to a lookup table containg an attack set would take very few machine instructions and would produce a perfectly compressed index. I bet people could think up a hundred uses for such an instruction other than creating an index. Bits would be able to be shifted seperatly, some could be shifted up while others are shifted down at the same time as well as mapped in all kinds of crazy ways and they could be expanded as well as compressed. A sequence of such instructions could do massive amounts of parallel processing of smaller data types!

What are some of the things that can be done with a degree rotation instruction?

by **Michael Sherwin** » 06 Dec 2006, 08:38

Michael Sherwin wrote:If anyone from intel/amd reads here or somebody that knows somebody at intel/amd then they need to get the word out that there should be a new instruction included in future chips. This instruction would be very powerful and useful in any aplication that needs to compress scattered bits into a smaller space or just move bits around.

It would work like this:

Given a source mask and a destination mask, apply the source mask to a bit pattern in a register, so that the first bit common to the source mask and the bit pattern is set in the corresponding position (determined by the destination mask) of a register. So bit 7 of the source mask will be mapped to bit 7 of the destination mask if it exist in the bit pattern.

Please, please, please, please! . Not that I'm begging!

An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data. Also an add mask could be created that would add one to every bit in the pattern under the add mask.

If anyone of any influence is reading this and can see the power of this computing paradigm then they need to get the ball rolling on this. The next generation single core chip that would result would be more powerful at doing many things than this generations Quad core!

Maybe a non contiguous bit coprosessor can be created first and integrated in the future. Graphics card manufacturers may also want this ability. Just make sure that us chess programmers have access to it!

So people of this forum, we need to enumerate all the useful things that this idea can do, create a monster thread and then send it all to intel and/or amd.

Or everyone could just ignor me.

by **H.G.Muller** » 06 Dec 2006, 10:20

Michael Sherwin wrote:What are some of the things that can be done with a degree rotation instruction?

What you could do is rotate the occupation board such that the attack ray you want to calculate becomes contiguous and ordered in the direction low bit to high bit, so that you can use hardware carry propagation to create the attacks without any table lookups.

An even more useful alternative would be an instruction that packed the bits corresponding to various half-rays in the bytes of a (64-bit) word. It would be similar in syntax to a shift instruction, one operand would be the 64-bit source/destination register containing the bitboard, the low-order 6 bits of the second source operand would contain the square number. If that number was (say) D3, E3-H3 would go to bits 0-3, C3-A3 to bits 8-10, D4-D8 to bits 16-20, D2-D1 to bits 24-25, E4-H7 to bits 32-35, C2-B1 to bits 40-41, C4-A6 to bits 48-50 and E2-F1 to bits 56-57. The remaining bits would stay zero.

Applying such an instruction to the occupied bitboard, you could then calculate the attacks of all 8 rays simultaneaously by normal carry propagation, through subtracting 0x0101010101010101LL (similar to the 'occupied^(occupied-2*rooks)' trick. To calculate the attacks for a Rook or Bishop you would subtract 0x01010101 or 0x0101010100000000LL, respectively. Of course you would have the inverse instruction, shuffling the 8 rays back into regular bitboard order (filling all bits not on the rays from the specified square with zeros) to directly give you the attack set.

For collecting key bits the instruction already exists. It is called multiply...

by **H.G.Muller** » 06 Dec 2006, 11:35

To not make the demands on chip real-estate too heavy, I might even settle for an instruction that does the shuffle of the bits in a square-independent way. Using a normal rotate instruction (ror) to position the occupied square into the most-significant bit without loss of bits, and then use the new instruction to collect the 8 x 7 bits that could be on the 8 rays (if they had not wrapped around the board edge) in the proper order into the 8 bytes of the word.

This would just require 2 new instructions with a fixed shuffled pattern in its single (64-bit) register operand: ray-to-board (r2b) and its inverse board-to-ray (b2r).

You could then generate e.g. QueenAttacks by rotating the Occupied bitboard right by the Square number, applying a board-to-ray instruction, masking with a square-dependent mask (from a 64-entry table) to kill any bits that went over the edge (using an OR to create edge stops in the 8th (most-significant) bit of each byte), subtracting 0x01010101...., XORing the result with the unsubtracted value, and use ray-to-board and the opposite rotate to create the Attacks bitboard. So 2 rotates, the 2 new instructions, an OR, a SUB and an XOR, 7 instructions in total.

The exact mapping of the new instructions would be:

Code: Select all: brd ray 1 0 Rank to the left 2 1 3 2 4 3 5 4 6 5 7 6 63 8 Rank to the right 62 9 61 10 60 11 59 12 58 13 57 14 8 16 File forward 16 17 24 18 32 19 40 20 48 21 56 22 56 24 File backward 48 25 40 26 32 27 24 28 16 29 8 30 9 32 Diagonal left forward 18 33 27 34 36 35 45 36 54 37 63 38 55 40 Diagonal right backward 46 41 37 42 28 43 19 44 10 45 1 46 7 48 Diagonal right forward 14 49 21 50 28 51 35 52 42 53 49 54 57 56 Diagonal left backward 50 57 43 58 36 59 29 60 22 61 15 62

Note that the mapping is not 1-to-1, and some bits might have two destinations (in board-to-ray) or two sources (in ray-to-board). With two sources the result bit would simply be the OR of the two source bits. This would not create any ambiguity, since the Attacks, once generated in ray form, would not contain any over-the-edge bits.

That a certain square (e.g. 7) seems to be on 2 rays to board-to-ray is no problem: one of the two paths reaching it will always cross the board edge, and will be masked off (to a one). Which one of the two that is depends on the location of the square, so the mask can know that.

by **Michael Sherwin** » 06 Dec 2006, 11:43

H.G.Muller wrote:
Michael Sherwin wrote:What are some of the things that can be done with a degree rotation instruction?

What you could do is rotate the occupation board such that the attack ray you want to calculate becomes contiguous and ordered in the direction low bit to high bit, so that you can use hardware carry propagation to create the attacks without any table lookups.

An even more useful alternative would be an instruction that packed the bits corresponding to various half-rays in the bytes of a (64-bit) word. It would be similar in syntax to a shift instruction, one operand would be the 64-bit source/destination register containing the bitboard, the low-order 6 bits of the second source operand would contain the square number. If that number was (say) D3, E3-H3 would go to bits 0-3, C3-A3 to bits 8-10, D4-D8 to bits 16-20, D2-D1 to bits 24-25, E4-H7 to bits 32-35, C2-B1 to bits 40-41, C4-A6 to bits 48-50 and E2-F1 to bits 56-57. The remaining bits would stay zero.

Applying such an instruction to the occupied bitboard, you could then calculate the attacks of all 8 rays simultaneaously by normal carry propagation, through subtracting 0x0101010101010101LL (similar to the 'occupied^(occupied-2*rooks)' trick. To calculate the attacks for a Rook or Bishop you would subtract 0x01010101 or 0x0101010100000000LL, respectively. Of course you would have the inverse instruction, shuffling the 8 rays back into regular bitboard order (filling all bits not on the rays from the specified square with zeros) to directly give you the attack set.

For collecting key bits the instruction already exists. It is called multiply...

Very nice for chess!

However, if it can not apply to other domains other than chess and chess like games then the chip manufacturers won't add it to their chips. What other types of applications can benifit from those instructions?

by **Michael Sherwin** » 06 Dec 2006, 11:54

H.G.Muller wrote:To not make the demands on chip real-estate too heavy, I might even settle for an instruction that does the shuffle of the bits in a square independent way. Using a normal rotate instruction (ror) to position the occupied square into the most-significant bit without loss of bits, and then use the new instruction to collect the 8 x 7 bits that could be on the 8 rays (if they had not wrapped around the board edge) in the proper order into the 8 bytes of the word.

This would just require 2 new instructions with a fixed shuffled pattern in its single (64-bit) register operand: ray-to-board (r2b) and its inverse board-to-ray (b2r).

You could then generate e.g. QueenAttacks by rotating the Occupied bitboard right by the Square number, applying a board-to-ray instruction, masking with a square-dependent mask (from a 64-entry table) to kill any bits that went over the edge (using an OR to create edge stops in the 8th (most-significant) bit of each byte), subtracting 0x01010101...., XORing the result with the unsubtracted value, and use ray-to-board and the opposite rotate to create the Attacks bitboard. So 2 rotates, the 2 new instructions, an OR, a SUB and an XOR, 7 instructions in total.

What we have both proposed would surely allow the next generation single core chip to run circles around todays quad core chips. I do not see a realestate problem in that respect. Even if a quad core had but one specialized bit operations coprossor built in, the extra realestate added would be minimal compared to that of the 4 cores!

by **H.G.Muller** » 06 Dec 2006, 12:20

I am not sure we couldn't convince CPU builders to add two simple instructions if it helps them to score better on the SpecInt benchmark. Crafty is part of that benchmark.

by **Michael Sherwin** » 06 Dec 2006, 12:54

H.G.Muller wrote:I am not sure we couldn't convince CPU builders to add two simple instructions if it helps them to score better on the SpecInt benchmark. Crafty is part of that benchmark.

Cool!

by **Gian-Carlo Pascutto** » 08 Dec 2006, 17:17

You would also need a new compiler that is smart enough to use the new instructions.

BTW. The latest SPEC contains Sjeng, but not Crafty.

by **Steffan Westcott** » 10 Dec 2006, 01:08

Michael Sherwin wrote:An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data.

Code to do this with simple existing instructions has already been presented to you, perhaps you missed it:

Code: Select all: void enumsets(const BitBoard d) { BitBoard n = 0; do { do_something_with_set(n); n = (n-d) & d; } while (n); }

Cheers,
Steffan

by **Michael Sherwin** » 10 Dec 2006, 07:17

Steffan Westcott wrote:
Michael Sherwin wrote:An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data.

Code to do this with simple existing instructions has already been presented to you, perhaps you missed it:
Code: Select all
void enumsets(const BitBoard d) { BitBoard n = 0; do { do_something_with_set(n); n = (n-d) & d; } while (n); }

Cheers,
Steffan

As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?

Cheers,
Mike

by **Gerd Isenberg** » 10 Dec 2006, 10:35

Michael Sherwin wrote: As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?

You really want a suband alu-instruction? Seems like a waste of silicon. We will get population-count and leading zero count in the near future - bsf/bsr has become fast again on new intels. I still vote for a reverse add 2+2=1 ;-)

Gerd

by **H.G.Muller** » 10 Dec 2006, 10:57

Would that really be more useful than the board-to-ray and ray-to-board?

How would you use it?

by **Gerd Isenberg** » 10 Dec 2006, 11:16

H.G.Muller wrote:Would that really be more useful than the board-to-ray and ray-to-board?

How would you use it?

Board-to-ray and ray-to-board would be nice, but seems quite restricted to the domain of chess or some other board games, while reverse add/sub might be more general for bit-twiddling tricks based to mask/reset msb instead of lsb.

by **H.G.Muller** » 10 Dec 2006, 12:47

Reverse add also seems not very general, and is rather expensive to add: you would basically need a separate adder for that.

You could achieve nearly the same by a reverse instruction, to reverse the bit-order before you add, with much less new hardware. If we are going for generality, we might best ask for that, as part of an entire series of bit-shuffle instructions. Or, preferably, a programmable bit-shuffle instruction, where a second operand specifies what kind of shuffle we want.

How about a 'barrel swapper', similar to a barrel shifter. In the latter, the n-th bit of the second operand (the shift count) controls if you shift by 2^n or not, and it is usually implemented as a sequence of two-input multiplexers that are wired to either transmit the unshifted input, or the input shifted (or rotated) by 2^n. (In practice they probably implement it as two 8-input multiplexers in tandem, because I can't imagine that you could propagate through the 6 multiplexers needed to do any shift from 0 to 63 in a single clock cycle.) In the barrel swapper you would use the similar strategy, but in stead of shifting by 2^n bit (or not) you would swap the two halfs of a subunit of 2^(n+1) bits (or not). So if the swap selector would be 63, you would completely reverse the word, if it was 7, you would only reverse the bits in each byte, if it was 56 you would reverse the byte order, etc.

Of course it would be a pity to only use 6 bits out of the available 64 in the swap-selector operand (as the shifts do). Without any additional hardware you could control the multiplexers that perform the swap over a certain distance not for the full width of the word, but in sub-sections. E.g. in stead of having a single bit that controls if bit 2*i and 2*i+1 are swapped for every i=0-31 you could have 8 bits that independently control it for i=0-3, i=4-7, ... i=28-31, so that you can independently shuffle all eight bytes in a word. Similarly, for swap distances of a byte or more, you could independently specify if that swap should occur for each bit in the byte.

This should provide very flexible bit manipulation.

by **Gerd Isenberg** » 10 Dec 2006, 13:20

While my idea was in the sense to don't discriminate carry propagation in different directions, your 'barrel swapper' is really a serious proposal.

by **H.G.Muller** » 10 Dec 2006, 13:28

If I am not mistaking, you can mirror a bitboard in the diagonal with two such instructions: (At least, for 4x4, I was too lazy to write it out for 8x8, but I believe it follows through induction. :wink:

)

Code: Select all: original 0 1 2 3 4 5 6 7 8 9 A B C D E F swap bits of bytes (well, nybbles in this example, actually) 0 1 2 3 1st 'byte': unchanged 5 4 7 6 2nd 'byte': bits of duos A B 8 9 3rd 'byte': duos of nybbles F E D C 4th 'byte': both swap bytes of bits (in same instruction!) _______ 1st bit: unchanged | _____ 2nd bit: swap neighboring 'bytes' | | ___ 3rd bit: swap neighboring 'words' | | | _ 4th bit: swap both | | | | v v v v 0 4 8 C 5 1 D 9 A E 2 6 F B 7 3 swap bits of bytes (second instruction) 0 4 8 C 1st 'byte': unchanged 1 5 9 D 2nd 'byte': bits of duos 2 6 A E 3rd 'byte': duos of nybble 3 7 B F 4th 'byte': both (shuffling bits between bytes is not necessary in the second instruction)

Horizontal and vertical reflections go with a single instruction, (actually, hallf such an instruction

), as does 180-degree rotation. 90-degree rotations are just variants of the diagonal mirror, and also take two instructions.

From a hardware point of view the instructions would not be any more complex than normal shifts. (Just wired differently. Perhaps a clever design could even combine the existing barrel shifter with the barrel swapper, because half of those are doing the same thing.) So this should be an instruction with latency 1, and a throughput of 1 per cycle.

by **Michael Sherwin** » 10 Dec 2006, 18:39

Gerd Isenberg wrote:
Michael Sherwin wrote: As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?

You really want a suband alu-instruction? Seems like a waste of silicon. We will get population-count and leading zero count in the near future - bsf/bsr has become fast again on new intels. I still vote for a reverse add 2+2=1

Gerd

There was a whole lot more to my proposal than the one simple example that Westcott decided to pick on and was not even the main focus of my idea. The main focus of my idea was a noncintiguous bit sub processor under control of source mask and destination mask. A much bigger waste of silicon! :wink:

Winboard Forum

New instruction that intel/amd should add

New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Re: New instruction that intel/amd should add

Who is online