[fpc-devel] Division optimisations
J. Gareth Moreton
gareth at moreton-family.com
Sun Sep 12 07:22:29 CEST 2021
So I've got some pretty good headway so far!
Trunk:
Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration
duration: 0.757 ns
Signed 32-bit (n mod 3) = 0 - Pass - average iteration
duration: 6.403 ns
Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration
duration: 0.698 ns
Signed 32-bit (n mod 10) = 0 - Pass - average iteration
duration: 6.461 ns
Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration
duration: 0.931 ns
Signed 32-bit (n mod 100) = 0 - Pass - average iteration
duration: 6.286 ns
Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration
duration: 0.990 ns
Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration
duration: 1.048 ns
Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration
duration: 0.698 ns
Signed 64-bit (n mod 3) = 0 - Pass - average iteration
duration: 6.403 ns
Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration
duration: 0.757 ns
Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration
duration: 6.403 ns
Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration
duration: 0.990 ns
Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration
duration: 6.286 ns
Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration
duration: 0.990 ns
New algorithm:
Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration
duration: 0.524 ns
Signed 32-bit (n mod 3) = 0 - Pass - average iteration
duration: 0.640 ns
Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration
duration: 0.698 ns
Signed 32-bit (n mod 10) = 0 - Pass - average iteration
duration: 0.815 ns
Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration
duration: 0.640 ns
Signed 32-bit (n mod 100) = 0 - Pass - average iteration
duration: 0.640 ns
Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration
duration: 0.582 ns
Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration
duration: 0.582 ns
Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration
duration: 0.640 ns
Signed 64-bit (n mod 3) = 0 - Pass - average iteration
duration: 0.815 ns
Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration
duration: 0.815 ns
Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration
duration: 0.873 ns
Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration
duration: 0.815 ns
Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration
duration: 0.873 ns
Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration
duration: 0.757 ns
I tend to shave off a few fractions of a nanosecond for unsigned modulus
operations, while signed modulus operations, which still use IDIV
internally due to some awkwardness with how moduli are calculated (which
still needs to be resolved), the saving is absolutely massive.
64-bit is a little slower than 32-bit possibly because the code size is
quite large due to there being 3 different 64-bit constants that need to
be loaded (for 32-bit and under, these constants can be directly encoded
in the individual instructions as immediates). Since the 3rd constant
is just the 2nd one having been bit-shifted, a better approach would be
to temporarily store the second constant in a register and then shift it
at the same time as another mathematical operation (thus using two ALU
ports to execute them simultaneously). As specified in Hacker's
Delight, this is also the recommended approach for RISC processors such
as AArch64 where encoding 64-bit constants takes up to 4 instructions.
This, however, would require the use of a tempref, something I'm still
researching.
Under x86_64, the use of a shift instead of a load could be done using a
peephole optimisation. For example, the code generated for signed
64-bit "(n mod 3) = 0" is currently:
movq %rax,%r8
movq $-6148914691236517205,%r11
imulq %r11,%r8
movq $3074457345618258602,%r11
addq %r11,%r8
movq $6148914691236517204,%r11
cmpq %r11,%r8
With a small addition to DeepMOVOpt, the peephole optimizer could easily
spot that 6148914691236517204 is exactly double 3074457345618258602, and
change the second mov instruction to "shlq $1,%r11", which only requires
3 bytes to store (4 if the shift is something other than 1), compared to
"movq $6148914691236517204,%r11" which requires 10. As long as the
original value is used at some point (in this case, via "addq
%r11,%r8"), it takes the same number of cycles to execute.
I'm waiting until my last division patches are uploaded before opening
an issue because I'm adding quite a few new tests to tests/bench/bdiv
and I want to minimise merge conflicts. That and I still need to test
how range checks affect the compilation, since the internal
multiplications deliberately overflow.
Gareth aka. Kit
On 10/09/2021 21:59, J. Gareth Moreton via fpc-devel wrote:
> I suppose in truth, I can, and that in itself is probably fairly
> cross-platform (although I'll stick with x86 for the moment and get
> that working). Sometimes the simple solution eludes me! Is there
> anything I need to take into account when it comes to range checking
> (that is, if a third party tries to compile a unit with range checking
> enabled), since "numerator * $AAAAAAAB" when constrained to 32 bits
> will almost always overflow?
>
> Gareth aka. Kit
>
> On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
>> Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
>>> Hi everyone,
>>>
>>> I'm looking at ways to optimise div and mod, starting with x86 and
>>> then probably AArch64. The obvious one is attempting to merge "Q :=
>>> N div D; R := N mod D;", where D is a variable (but invariant
>>> between the two instructions), since DIV returns the quotient in
>>> R/EAX and the remainder in R/EDX in a single operation, or
>>> converting the latter equation to "R := N - (Q * D);" if D is a
>>> constant.
>>>
>>> However, inspired somewhat by "Hacker's Delight", I would like to
>>> first see if I can optimise the Boolean condition "(X mod C) = 0",
>>> where C is a constant. By calculating the multiplicative reciprocal
>>> of C (it may or may not be equal to the 'magic div' constant), you
>>> can perform it with just a multiplication and a comparison - for
>>> example, when dividing by 3 and returning the remainder:
>>>
>>> mov (numerator),%reg1
>>> mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
>>> imul %reg1,%reg2
>>> cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
>>>
>>> If %reg2 is less than or equal to $55555555, then the numerator is
>>> an exact multiple of 3, and if it's greater, then it is not an exact
>>> multiple. The proof for this is explained in Hacker's Delight, but
>>> relies on the fact that 3 and 2^32 are relatively prime and the
>>> exact multiples of 3 multiplied by 3's reciprocal modulo 2^32 map
>>> onto the values 0 to $55555555 (if the divisor is even, which means
>>> it's not relatively prime to 2^32, you have to do a bit of trickery
>>> with a bit rotation, but done properly, it's only 1 extra instruction).
>>>
>>> I'm trying to think of a way to make this clean and flexible,
>>> especially where future expansion is concerned. One idea I had was
>>> to create a new platform-specific node such as "tx86divisible",
>>> which takes an integer variable (x) and an integer constant (c) and
>>> returns True if x mod c = 0, and "(X mod C) = 0" code is converted
>>> to this node via tx86addnode.simplify (the node used for
>>> comparisons), so it can be quickly converted into the optimal code
>>> in pass_generate_code. The other option is to do this conversion in
>>> pass_generate_code, where a new node type isn't required but might
>>> be a little trickier to make cross-platform... if it's possible to
>>> make "tx86divisible" completely cross-platform - that is, have an
>>> implementation on every target - the node conversion code only has
>>> to exist in a single place, thus improving maintainability.
>>>
>>> What do you suggest?
>>
>> Can't you generate a mul and cmp node in tx86addnode.simplify which
>> simulates this behavior?
>> _______________________________________________
>> fpc-devel maillist - fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>
>
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list