[fpc-devel] The "magic div" algorithm
J. Gareth Moreton
gareth at moreton-family.com
Fri Sep 3 22:12:34 CEST 2021
Div by 3 has a very elegant implementation. In 32-bit, it's
movl $0xAAAAAAAB,%eax
mull Input ; answer is in %edx:%eax
shrl $1,%edx
movl %edx, Result
As Marģers discovered, you can reduce the cycle count and minimise
pipeline stalls by extending to 64-bit and applying the shift to the
reciprocal, since the smaller input domain permits a larger rounding
error before it starts affecting the result:
movq $0x5555555555555556, %rax
mulq Input ; answer is in %rdx:%rax. Since input is a 32-bit
register, the upper 32 bits of its 64-bit counterpart are zero.
movl %edx, Result
Now, 7 (in fact, any value of the form 7 × 2^k, where k ≥ 0) is a bit
different because there's no magic number and shift value that will work
for the entire 32-bit range; you need an extra bit of precision. To
simulate this, the input is added to the high word of the result, as if
there were a 1 in the 33rd bit of the reciprocal.
movl $0x24924925,%eax
mull input
addl input,%edx
rcrl $1,%edx
shrl $2,%edx
movl %edx, Result
If the add operation overflows, you need to use rcrl (rotate with carry)
to cycle the carry bit back into the result as you shift right, then you
can shift right normally to get the result you need.
In 64-bit, this problem vanishes because you can insert that extra bit
of precision as you shift the magic number - this sequence is about
twice as fast, not just due to the reduced number of operations, but the
reduction in pipeline stalls from each operation waiting on the previous
one to complete (i.e. addl input,%edx cannot execute until the
multiplication is finished, rcrl $1,%edx has to wait on addl, shrl
$2,%edx has to wait on rcrl, and movl %edx, Result has to wait on shrl):
movq 0x2492492492492493
mulq input
movl %edx, Result
Modern x86 processors do out-of-order execution and can delay "movl
%edx, Result" until it's actually needed, executing any instructions
after it in parallel while the multiplication completes. Unfortunately
I received some test failures in tmoddiv4 and tmoddiv6 so I'm not out of
the woods yet, and I'll have to see what's going on with those before I
can continue.
Gareth aka. Kit
On 03/09/2021 19:56, Ched via fpc-devel wrote:
> Very interesting, Gareth !
>
> Is the div-by-7 related to 2 to the 3rd ? If yes, is it possible to
> design a div-by-3 with similar magics ?
>
> Cheers, Ched'
>
>
>
> Le 03.09.21 à 15:35, J. Gareth Moreton via fpc-devel a écrit :
>> Hey Marģers,
>>
>> So I've been experimenting with your suggestion, and it looks like a
>> resounding success! I added some new tests to the "bdiv" bench test
>> to see how it performs. 16-bit multiplications don't get improved as
>> well as they could be on x86_64 because the intermediate values are
>> all extended to 32-bit, but you can still see a massive improvement
>> on all the unsigned divisions by 7.
>>
>> I'm currently running the test suite for i386-win32 and x86_64-win64,
>> then will implement the same system for AArch64. Thanks so much for
>> spotting the improvement.
>>
>> Gareth aka. Kit
>>
>> P.S. I still need to work on the signed modulus at some point.
>
>
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list