Hi Giuseppe,
yes, your code is faster for P4 - and it could be inlined. The _aullshr is a call with 3 cases. I would prefere a 64-bit shift with shift amount modulo 64 and only two cases and an inlined intrinsic as well, rather than call/ret overhead, and possible "random" shift amounts called from different contexts, which makes it eventually harder to predict the >= 32 branch correctly. Otoh inlining conditional branches a lot may pollute branch target buffer, so everything has two sides...
- Code: Select all
_aullshr:
0040E390 80 F9 40 cmp cl,40h
0040E393 73 15 jae RETZERO
0040E395 80 F9 20 cmp cl,20h
0040E398 73 06 jae MORE32
0040E39A 0F AD D0 shrd eax,edx,cl
0040E39D D3 EA shr edx,cl
0040E39F C3 ret
MORE32:
0040E3A0 8B C2 mov eax,edx
0040E3A2 33 D2 xor edx,edx
0040E3A4 80 E1 1F and cl,1Fh
0040E3A7 D3 E8 shr eax,cl
0040E3A9 C3 ret
RETZERO:
0040E3AA 33 C0 xor eax,eax
0040E3AC 33 D2 xor edx,edx
0040E3AE C3 ret
The shrd is very slow on P4, thus i guess an inlined shrd-replacement, like you suggest might be faster. But shift on P4 is "dead"-slow anyway (shift alu is located in the MMX-unit).
For better readability i prefere anonymious 64/32[2]-bit unions rather than pointer casts. Both methods are not portable with respect to endianess. So simple >>/<< 32 is the prefered method to get and set high dwords and should not produce any overhead in 32-bit mode.
Why not looking forward to a better 32/64-bit speedup?
Cheers,
Gerd