Moderator: Andres Valverde
H.G.Muller wrote:I would have much more use for instructions that rotate a bitboard by n*45 degrees.
Michael Sherwin wrote:If anyone from intel/amd reads here or somebody that knows somebody at intel/amd then they need to get the word out that there should be a new instruction included in future chips. This instruction would be very powerful and useful in any aplication that needs to compress scattered bits into a smaller space or just move bits around.
It would work like this:
Given a source mask and a destination mask, apply the source mask to a bit pattern in a register, so that the first bit common to the source mask and the bit pattern is set in the corresponding position (determined by the destination mask) of a register. So bit 7 of the source mask will be mapped to bit 7 of the destination mask if it exist in the bit pattern.
Please, please, please, please! . Not that I'm begging!
Michael Sherwin wrote:What are some of the things that can be done with a degree rotation instruction?
brd ray
1 0 Rank to the left
2 1
3 2
4 3
5 4
6 5
7 6
63 8 Rank to the right
62 9
61 10
60 11
59 12
58 13
57 14
8 16 File forward
16 17
24 18
32 19
40 20
48 21
56 22
56 24 File backward
48 25
40 26
32 27
24 28
16 29
8 30
9 32 Diagonal left forward
18 33
27 34
36 35
45 36
54 37
63 38
55 40 Diagonal right backward
46 41
37 42
28 43
19 44
10 45
1 46
7 48 Diagonal right forward
14 49
21 50
28 51
35 52
42 53
49 54
57 56 Diagonal left backward
50 57
43 58
36 59
29 60
22 61
15 62
H.G.Muller wrote:Michael Sherwin wrote:What are some of the things that can be done with a degree rotation instruction?
What you could do is rotate the occupation board such that the attack ray you want to calculate becomes contiguous and ordered in the direction low bit to high bit, so that you can use hardware carry propagation to create the attacks without any table lookups.
An even more useful alternative would be an instruction that packed the bits corresponding to various half-rays in the bytes of a (64-bit) word. It would be similar in syntax to a shift instruction, one operand would be the 64-bit source/destination register containing the bitboard, the low-order 6 bits of the second source operand would contain the square number. If that number was (say) D3, E3-H3 would go to bits 0-3, C3-A3 to bits 8-10, D4-D8 to bits 16-20, D2-D1 to bits 24-25, E4-H7 to bits 32-35, C2-B1 to bits 40-41, C4-A6 to bits 48-50 and E2-F1 to bits 56-57. The remaining bits would stay zero.
Applying such an instruction to the occupied bitboard, you could then calculate the attacks of all 8 rays simultaneaously by normal carry propagation, through subtracting 0x0101010101010101LL (similar to the 'occupied^(occupied-2*rooks)' trick. To calculate the attacks for a Rook or Bishop you would subtract 0x01010101 or 0x0101010100000000LL, respectively. Of course you would have the inverse instruction, shuffling the 8 rays back into regular bitboard order (filling all bits not on the rays from the specified square with zeros) to directly give you the attack set.
For collecting key bits the instruction already exists. It is called multiply...
H.G.Muller wrote:To not make the demands on chip real-estate too heavy, I might even settle for an instruction that does the shuffle of the bits in a square independent way. Using a normal rotate instruction (ror) to position the occupied square into the most-significant bit without loss of bits, and then use the new instruction to collect the 8 x 7 bits that could be on the 8 rays (if they had not wrapped around the board edge) in the proper order into the 8 bytes of the word.
This would just require 2 new instructions with a fixed shuffled pattern in its single (64-bit) register operand: ray-to-board (r2b) and its inverse board-to-ray (b2r).
You could then generate e.g. QueenAttacks by rotating the Occupied bitboard right by the Square number, applying a board-to-ray instruction, masking with a square-dependent mask (from a 64-entry table) to kill any bits that went over the edge (using an OR to create edge stops in the 8th (most-significant) bit of each byte), subtracting 0x01010101...., XORing the result with the unsubtracted value, and use ray-to-board and the opposite rotate to create the Attacks bitboard. So 2 rotates, the 2 new instructions, an OR, a SUB and an XOR, 7 instructions in total.
H.G.Muller wrote:I am not sure we couldn't convince CPU builders to add two simple instructions if it helps them to score better on the SpecInt benchmark. Crafty is part of that benchmark.
Michael Sherwin wrote:An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data.
void enumsets(const BitBoard d)
{
BitBoard n = 0;
do {
do_something_with_set(n);
n = (n-d) & d;
} while (n);
}
Steffan Westcott wrote:Michael Sherwin wrote:An excellant idea would be to be able to do simple math on the source data pattern controled by the source mask such that, if for example, you wanted to add 1 to the masked pattern the 1 would be added to the first bit of the source data under the source mask and the carry would go to the next masked bit. This would allow simple iteration through all the possible sets (values) of the masked data.
Code to do this with simple existing instructions has already been presented to you, perhaps you missed it:
- Code: Select all
void enumsets(const BitBoard d)
{
BitBoard n = 0;
do {
do_something_with_set(n);
n = (n-d) & d;
} while (n);
}
Cheers,
Steffan
Michael Sherwin wrote: As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?
H.G.Muller wrote:Would that really be more useful than the board-to-ray and ray-to-board?
How would you use it?
original
0 1 2 3
4 5 6 7
8 9 A B
C D E F
swap bits of bytes (well, nybbles in this example, actually)
0 1 2 3 1st 'byte': unchanged
5 4 7 6 2nd 'byte': bits of duos
A B 8 9 3rd 'byte': duos of nybbles
F E D C 4th 'byte': both
swap bytes of bits (in same instruction!)
_______ 1st bit: unchanged
| _____ 2nd bit: swap neighboring 'bytes'
| | ___ 3rd bit: swap neighboring 'words'
| | | _ 4th bit: swap both
| | | |
v v v v
0 4 8 C
5 1 D 9
A E 2 6
F B 7 3
swap bits of bytes (second instruction)
0 4 8 C 1st 'byte': unchanged
1 5 9 D 2nd 'byte': bits of duos
2 6 A E 3rd 'byte': duos of nybble
3 7 B F 4th 'byte': both
(shuffling bits between bytes is not necessary in the second instruction)
Gerd Isenberg wrote:Michael Sherwin wrote: As anything can be done in software then the whole point of my idea was to do it faster in hardware. Or did you not grasp that?
You really want a suband alu-instruction? Seems like a waste of silicon. We will get population-count and leading zero count in the near future - bsf/bsr has become fast again on new intels. I still vote for a reverse add 2+2=1
Gerd
Return to Programming and Technical Discussions
Users browsing this forum: No registered users and 1 guest