[fpc-devel] The "magic div" algorithm

Fri Sep 3 22:12:34 CEST 2021

Div by 3 has a very elegant implementation.  In 32-bit, it's

movl    $0xAAAAAAAB,%eax
mull    Input ; answer is in %edx:%eax
shrl    $1,%edx
movl    %edx, Result

As Marģers discovered, you can reduce the cycle count and minimise 
pipeline stalls by extending to 64-bit and applying the shift to the 
reciprocal, since the smaller input domain permits a larger rounding 
error before it starts affecting the result:

movq    $0x5555555555555556, %rax
mulq    Input ; answer is in %rdx:%rax.  Since input is a 32-bit 
register, the upper 32 bits of its 64-bit counterpart are zero.
movl    %edx, Result

Now, 7 (in fact, any value of the form 7 × 2^k, where k ≥ 0) is a bit 
different because there's no magic number and shift value that will work 
for the entire 32-bit range; you need an extra bit of precision.  To 
simulate this, the input is added to the high word of the result, as if 
there were a 1 in the 33rd bit of the reciprocal.

movl    $0x24924925,%eax
mull    input
addl    input,%edx
rcrl    $1,%edx
shrl    $2,%edx
movl    %edx, Result

If the add operation overflows, you need to use rcrl (rotate with carry) 
to cycle the carry bit back into the result as you shift right, then you 
can shift right normally to get the result you need.

In 64-bit, this problem vanishes because you can insert that extra bit 
of precision as you shift the magic number - this sequence is about 
twice as fast, not just due to the reduced number of operations, but the 
reduction in pipeline stalls from each operation waiting on the previous 
one to complete (i.e. addl input,%edx cannot execute until the 
multiplication is finished, rcrl $1,%edx has to wait on addl, shrl 
$2,%edx has to wait on rcrl, and movl %edx, Result has to wait on shrl):

movq    0x2492492492492493
mulq   input
movl    %edx, Result

Modern x86 processors do out-of-order execution and can delay "movl 
%edx, Result" until it's actually needed, executing any instructions 
after it in parallel while the multiplication completes.  Unfortunately 
I received some test failures in tmoddiv4 and tmoddiv6 so I'm not out of 
the woods yet, and I'll have to see what's going on with those before I 
can continue.

Gareth aka. Kit

On 03/09/2021 19:56, Ched via fpc-devel wrote:
> Very interesting, Gareth !
>
> Is the div-by-7 related to 2 to the 3rd ? If yes, is it possible to 
> design a div-by-3 with similar magics ?
>
> Cheers, Ched'
>
>
>
> Le 03.09.21 à 15:35, J. Gareth Moreton via fpc-devel a écrit :
>> Hey Marģers,
>>
>> So I've been experimenting with your suggestion, and it looks like a 
>> resounding success!  I added some new tests to the "bdiv" bench test 
>> to see how it performs.  16-bit multiplications don't get improved as 
>> well as they could be on x86_64 because the intermediate values are 
>> all extended to 32-bit, but you can still see a massive improvement 
>> on all the unsigned divisions by 7.
>>
>> I'm currently running the test suite for i386-win32 and x86_64-win64, 
>> then will implement the same system for AArch64. Thanks so much for 
>> spotting the improvement.
>>
>> Gareth aka. Kit
>>
>> P.S. I still need to work on the signed modulus at some point.
>
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus