Page 1 of 1

shift 64 bit

PostPosted: 23 Aug 2006, 16:24
by Giuseppe Cannella
On 32 bit processor the right shift operator (>>) of 64 bit integer (u64) is very slow.
My code below seems faster.

I tried it with visual c++ on a Pentium 4 processor.

shift 7 bit...
u64 shr7(const u64 bits){ //bits >> 7
unsigned s2 = (((unsigned*)&bits)[1]);
u64 x=(((unsigned) bits)>>7)|(s2<<25); // 25 = 32 - 7
((unsigned*)(&x)+1)[0]=(s2>>7);
return x;
}


shift N bit...
u64 shrN(const u64 bits,const int N){ //bits >> N
assert(N);
if (N<32){
unsigned s2 = (((unsigned*)&bits)[1]);
u64 x=(((unsigned) bits)>>N)|(s2<<(32-N));
((unsigned*)(&x)+1)[0]=(s2>>N);
return x;
}
u64 x=shift32(bits)>>(N-32);
return x;
}

bye giuseppe

Re: shift 64 bit

PostPosted: 23 Aug 2006, 19:55
by Gerd Isenberg
Hi Giuseppe,

yes, your code is faster for P4 - and it could be inlined. The _aullshr is a call with 3 cases. I would prefere a 64-bit shift with shift amount modulo 64 and only two cases and an inlined intrinsic as well, rather than call/ret overhead, and possible "random" shift amounts called from different contexts, which makes it eventually harder to predict the >= 32 branch correctly. Otoh inlining conditional branches a lot may pollute branch target buffer, so everything has two sides...

Code: Select all
_aullshr:
0040E390 80 F9 40             cmp         cl,40h
0040E393 73 15                jae         RETZERO
0040E395 80 F9 20             cmp         cl,20h
0040E398 73 06                jae         MORE32
0040E39A 0F AD D0             shrd        eax,edx,cl
0040E39D D3 EA                shr         edx,cl
0040E39F C3                   ret
MORE32:
0040E3A0 8B C2                mov         eax,edx
0040E3A2 33 D2                xor         edx,edx
0040E3A4 80 E1 1F             and         cl,1Fh
0040E3A7 D3 E8                shr         eax,cl
0040E3A9 C3                   ret
RETZERO:
0040E3AA 33 C0                xor         eax,eax
0040E3AC 33 D2                xor         edx,edx
0040E3AE C3                   ret
The shrd is very slow on P4, thus i guess an inlined shrd-replacement, like you suggest might be faster. But shift on P4 is "dead"-slow anyway (shift alu is located in the MMX-unit).

For better readability i prefere anonymious 64/32[2]-bit unions rather than pointer casts. Both methods are not portable with respect to endianess. So simple >>/<< 32 is the prefered method to get and set high dwords and should not produce any overhead in 32-bit mode.

Why not looking forward to a better 32/64-bit speedup? ;-)

Cheers,
Gerd