[fpc-devel] Division optimisations
J. Gareth Moreton
gareth at moreton-family.com
Fri Sep 10 21:17:42 CEST 2021
Hi everyone,
I'm looking at ways to optimise div and mod, starting with x86 and then
probably AArch64. The obvious one is attempting to merge "Q := N div D;
R := N mod D;", where D is a variable (but invariant between the two
instructions), since DIV returns the quotient in R/EAX and the remainder
in R/EDX in a single operation, or converting the latter equation to "R
:= N - (Q * D);" if D is a constant.
However, inspired somewhat by "Hacker's Delight", I would like to first
see if I can optimise the Boolean condition "(X mod C) = 0", where C is
a constant. By calculating the multiplicative reciprocal of C (it may
or may not be equal to the 'magic div' constant), you can perform it
with just a multiplication and a comparison - for example, when dividing
by 3 and returning the remainder:
mov (numerator),%reg1
mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
imul %reg1,%reg2
cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
If %reg2 is less than or equal to $55555555, then the numerator is an
exact multiple of 3, and if it's greater, then it is not an exact
multiple. The proof for this is explained in Hacker's Delight, but
relies on the fact that 3 and 2^32 are relatively prime and the exact
multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the
values 0 to $55555555 (if the divisor is even, which means it's not
relatively prime to 2^32, you have to do a bit of trickery with a bit
rotation, but done properly, it's only 1 extra instruction).
I'm trying to think of a way to make this clean and flexible, especially
where future expansion is concerned. One idea I had was to create a new
platform-specific node such as "tx86divisible", which takes an integer
variable (x) and an integer constant (c) and returns True if x mod c =
0, and "(X mod C) = 0" code is converted to this node via
tx86addnode.simplify (the node used for comparisons), so it can be
quickly converted into the optimal code in pass_generate_code. The
other option is to do this conversion in pass_generate_code, where a new
node type isn't required but might be a little trickier to make
cross-platform... if it's possible to make "tx86divisible" completely
cross-platform - that is, have an implementation on every target - the
node conversion code only has to exist in a single place, thus improving
maintainability.
What do you suggest?
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list