[fpc-devel] Division optimisations

J. Gareth Moreton gareth at moreton-family.com
Fri Sep 10 21:17:42 CEST 2021


Hi everyone,

I'm looking at ways to optimise div and mod, starting with x86 and then 
probably AArch64.  The obvious one is attempting to merge "Q := N div D; 
R := N mod D;", where D is a variable (but invariant between the two 
instructions), since DIV returns the quotient in R/EAX and the remainder 
in R/EDX in a single operation, or converting the latter equation to "R 
:= N - (Q * D);" if D is a constant.

However, inspired somewhat by "Hacker's Delight", I would like to first 
see if I can optimise the Boolean condition "(X mod C) = 0", where C is 
a constant.  By calculating the multiplicative reciprocal of C (it may 
or may not be equal to the 'magic div' constant), you can perform it 
with just a multiplication and a comparison - for example, when dividing 
by 3 and returning the remainder:

mov (numerator),%reg1
mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
imul %reg1,%reg2
cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }

If %reg2 is less than or equal to $55555555, then the numerator is an 
exact multiple of 3, and if it's greater, then it is not an exact 
multiple.  The proof for this is explained in Hacker's Delight, but 
relies on the fact that 3 and 2^32 are relatively prime and the exact 
multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the 
values 0 to $55555555 (if the divisor is even, which means it's not 
relatively prime to 2^32, you have to do a bit of trickery with a bit 
rotation, but done properly, it's only 1 extra instruction).

I'm trying to think of a way to make this clean and flexible, especially 
where future expansion is concerned.  One idea I had was to create a new 
platform-specific node such as "tx86divisible", which takes an integer 
variable (x) and an integer constant (c) and returns True if x mod c = 
0, and "(X mod C) = 0" code is converted to this node via 
tx86addnode.simplify (the node used for comparisons), so it can be 
quickly converted into the optimal code in pass_generate_code.  The 
other option is to do this conversion in pass_generate_code, where a new 
node type isn't required but might be a little trickier to make 
cross-platform... if it's possible to make "tx86divisible" completely 
cross-platform - that is, have an implementation on every target - the 
node conversion code only has to exist in a single place, thus improving 
maintainability.

What do you suggest?

Gareth aka. Kit


-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



More information about the fpc-devel mailing list