[fpc-devel] Division optimisations

Fri Sep 10 21:53:22 CEST 2021

Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
> Hi everyone,
> 
> I'm looking at ways to optimise div and mod, starting with x86 and then 
> probably AArch64.  The obvious one is attempting to merge "Q := N div D; 
> R := N mod D;", where D is a variable (but invariant between the two 
> instructions), since DIV returns the quotient in R/EAX and the remainder 
> in R/EDX in a single operation, or converting the latter equation to "R 
> := N - (Q * D);" if D is a constant.
> 
> However, inspired somewhat by "Hacker's Delight", I would like to first 
> see if I can optimise the Boolean condition "(X mod C) = 0", where C is 
> a constant.  By calculating the multiplicative reciprocal of C (it may 
> or may not be equal to the 'magic div' constant), you can perform it 
> with just a multiplication and a comparison - for example, when dividing 
> by 3 and returning the remainder:
> 
> mov (numerator),%reg1
> mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
> imul %reg1,%reg2
> cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
> 
> If %reg2 is less than or equal to $55555555, then the numerator is an 
> exact multiple of 3, and if it's greater, then it is not an exact 
> multiple.  The proof for this is explained in Hacker's Delight, but 
> relies on the fact that 3 and 2^32 are relatively prime and the exact 
> multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the 
> values 0 to $55555555 (if the divisor is even, which means it's not 
> relatively prime to 2^32, you have to do a bit of trickery with a bit 
> rotation, but done properly, it's only 1 extra instruction).
> 
> I'm trying to think of a way to make this clean and flexible, especially 
> where future expansion is concerned.  One idea I had was to create a new 
> platform-specific node such as "tx86divisible", which takes an integer 
> variable (x) and an integer constant (c) and returns True if x mod c = 
> 0, and "(X mod C) = 0" code is converted to this node via 
> tx86addnode.simplify (the node used for comparisons), so it can be 
> quickly converted into the optimal code in pass_generate_code.  The 
> other option is to do this conversion in pass_generate_code, where a new 
> node type isn't required but might be a little trickier to make 
> cross-platform... if it's possible to make "tx86divisible" completely 
> cross-platform - that is, have an implementation on every target - the 
> node conversion code only has to exist in a single place, thus improving 
> maintainability.
> 
> What do you suggest?

Can't you generate a mul and cmp node in tx86addnode.simplify  which 
simulates this behavior?