[fpc-devel] Division optimisations

Fri Sep 10 22:59:43 CEST 2021

I suppose in truth, I can, and that in itself is probably fairly 
cross-platform (although I'll stick with x86 for the moment and get that 
working).  Sometimes the simple solution eludes me!  Is there anything I 
need to take into account when it comes to range checking (that is, if a 
third party tries to compile a unit with range checking enabled), since 
"numerator * $AAAAAAAB" when constrained to 32 bits will almost always 
overflow?

Gareth aka. Kit

On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
> Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
>> Hi everyone,
>>
>> I'm looking at ways to optimise div and mod, starting with x86 and 
>> then probably AArch64.  The obvious one is attempting to merge "Q := 
>> N div D; R := N mod D;", where D is a variable (but invariant between 
>> the two instructions), since DIV returns the quotient in R/EAX and 
>> the remainder in R/EDX in a single operation, or converting the 
>> latter equation to "R := N - (Q * D);" if D is a constant.
>>
>> However, inspired somewhat by "Hacker's Delight", I would like to 
>> first see if I can optimise the Boolean condition "(X mod C) = 0", 
>> where C is a constant.  By calculating the multiplicative reciprocal 
>> of C (it may or may not be equal to the 'magic div' constant), you 
>> can perform it with just a multiplication and a comparison - for 
>> example, when dividing by 3 and returning the remainder:
>>
>> mov (numerator),%reg1
>> mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
>> imul %reg1,%reg2
>> cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
>>
>> If %reg2 is less than or equal to $55555555, then the numerator is an 
>> exact multiple of 3, and if it's greater, then it is not an exact 
>> multiple.  The proof for this is explained in Hacker's Delight, but 
>> relies on the fact that 3 and 2^32 are relatively prime and the exact 
>> multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the 
>> values 0 to $55555555 (if the divisor is even, which means it's not 
>> relatively prime to 2^32, you have to do a bit of trickery with a bit 
>> rotation, but done properly, it's only 1 extra instruction).
>>
>> I'm trying to think of a way to make this clean and flexible, 
>> especially where future expansion is concerned.  One idea I had was 
>> to create a new platform-specific node such as "tx86divisible", which 
>> takes an integer variable (x) and an integer constant (c) and returns 
>> True if x mod c = 0, and "(X mod C) = 0" code is converted to this 
>> node via tx86addnode.simplify (the node used for comparisons), so it 
>> can be quickly converted into the optimal code in 
>> pass_generate_code.  The other option is to do this conversion in 
>> pass_generate_code, where a new node type isn't required but might be 
>> a little trickier to make cross-platform... if it's possible to make 
>> "tx86divisible" completely cross-platform - that is, have an 
>> implementation on every target - the node conversion code only has to 
>> exist in a single place, thus improving maintainability.
>>
>> What do you suggest?
>
> Can't you generate a mul and cmp node in tx86addnode.simplify which 
> simulates this behavior?
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus