[fpc-devel] Division optimisations
J. Gareth Moreton
gareth at moreton-family.com
Fri Sep 10 22:59:43 CEST 2021
I suppose in truth, I can, and that in itself is probably fairly
cross-platform (although I'll stick with x86 for the moment and get that
working). Sometimes the simple solution eludes me! Is there anything I
need to take into account when it comes to range checking (that is, if a
third party tries to compile a unit with range checking enabled), since
"numerator * $AAAAAAAB" when constrained to 32 bits will almost always
overflow?
Gareth aka. Kit
On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
> Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
>> Hi everyone,
>>
>> I'm looking at ways to optimise div and mod, starting with x86 and
>> then probably AArch64. The obvious one is attempting to merge "Q :=
>> N div D; R := N mod D;", where D is a variable (but invariant between
>> the two instructions), since DIV returns the quotient in R/EAX and
>> the remainder in R/EDX in a single operation, or converting the
>> latter equation to "R := N - (Q * D);" if D is a constant.
>>
>> However, inspired somewhat by "Hacker's Delight", I would like to
>> first see if I can optimise the Boolean condition "(X mod C) = 0",
>> where C is a constant. By calculating the multiplicative reciprocal
>> of C (it may or may not be equal to the 'magic div' constant), you
>> can perform it with just a multiplication and a comparison - for
>> example, when dividing by 3 and returning the remainder:
>>
>> mov (numerator),%reg1
>> mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
>> imul %reg1,%reg2
>> cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
>>
>> If %reg2 is less than or equal to $55555555, then the numerator is an
>> exact multiple of 3, and if it's greater, then it is not an exact
>> multiple. The proof for this is explained in Hacker's Delight, but
>> relies on the fact that 3 and 2^32 are relatively prime and the exact
>> multiples of 3 multiplied by 3's reciprocal modulo 2^32 map onto the
>> values 0 to $55555555 (if the divisor is even, which means it's not
>> relatively prime to 2^32, you have to do a bit of trickery with a bit
>> rotation, but done properly, it's only 1 extra instruction).
>>
>> I'm trying to think of a way to make this clean and flexible,
>> especially where future expansion is concerned. One idea I had was
>> to create a new platform-specific node such as "tx86divisible", which
>> takes an integer variable (x) and an integer constant (c) and returns
>> True if x mod c = 0, and "(X mod C) = 0" code is converted to this
>> node via tx86addnode.simplify (the node used for comparisons), so it
>> can be quickly converted into the optimal code in
>> pass_generate_code. The other option is to do this conversion in
>> pass_generate_code, where a new node type isn't required but might be
>> a little trickier to make cross-platform... if it's possible to make
>> "tx86divisible" completely cross-platform - that is, have an
>> implementation on every target - the node conversion code only has to
>> exist in a single place, thus improving maintainability.
>>
>> What do you suggest?
>
> Can't you generate a mul and cmp node in tx86addnode.simplify which
> simulates this behavior?
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list