[fpc-devel] Division optimisations
Florian Klämpfl
florian at freepascal.org
Fri Sep 10 21:53:22 CEST 2021
Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
> Hi everyone,
>
> I'm looking at ways to optimise div and mod, starting with x86 and then
> probably AArch64. The obvious one is attempting to merge "Q := N div D;
> R := N mod D;", where D is a variable (but invariant between the two
> instructions), since DIV returns the quotient in R/EAX and the remainder
> in R/EDX in a single operation, or converting the latter equation to "R
> := N - (Q * D);" if D is a constant.
>
> However, inspired somewhat by "Hacker's Delight", I would like to first
> see if I can optimise the Boolean condition "(X mod C) = 0", where C is
> a constant. By calculating the multiplicative reciprocal of C (it may
> or may not be equal to the 'magic div' constant), you can perform it
> with just a multiplication and a comparison - for example, when dividing
> by 3 and returning the remainder:
>
> mov (numerator),%reg1
> mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
> imul %reg1,%reg2
> cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
>
> If %reg2 is less than or equal to $55555555, then the numerator is an
> exact multiple of 3, and if it's greater, then it is not an exact
> multiple. The proof for this is explained in Hacker's Delight, but
> relies on the fact that 3 and 2^32 are relatively prime and the exact
> multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the
> values 0 to $55555555 (if the divisor is even, which means it's not
> relatively prime to 2^32, you have to do a bit of trickery with a bit
> rotation, but done properly, it's only 1 extra instruction).
>
> I'm trying to think of a way to make this clean and flexible, especially
> where future expansion is concerned. One idea I had was to create a new
> platform-specific node such as "tx86divisible", which takes an integer
> variable (x) and an integer constant (c) and returns True if x mod c =
> 0, and "(X mod C) = 0" code is converted to this node via
> tx86addnode.simplify (the node used for comparisons), so it can be
> quickly converted into the optimal code in pass_generate_code. The
> other option is to do this conversion in pass_generate_code, where a new
> node type isn't required but might be a little trickier to make
> cross-platform... if it's possible to make "tx86divisible" completely
> cross-platform - that is, have an implementation on every target - the
> node conversion code only has to exist in a single place, thus improving
> maintainability.
>
> What do you suggest?
Can't you generate a mul and cmp node in tx86addnode.simplify which
simulates this behavior?
More information about the fpc-devel
mailing list