[fpc-devel] Division optimisations

Sun Sep 12 07:22:29 CEST 2021

So I've got some pretty good headway so far!

Trunk:

               Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration 
duration: 0.757 ns
                 Signed 32-bit (n mod 3) = 0 - Pass - average iteration 
duration: 6.403 ns
              Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration 
duration: 0.698 ns
                Signed 32-bit (n mod 10) = 0 - Pass - average iteration 
duration: 6.461 ns
             Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration 
duration: 0.931 ns
               Signed 32-bit (n mod 100) = 0 - Pass - average iteration 
duration: 6.286 ns
             Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration 
duration: 0.990 ns
           Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration 
duration: 1.048 ns
               Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration 
duration: 0.698 ns
                 Signed 64-bit (n mod 3) = 0 - Pass - average iteration 
duration: 6.403 ns
              Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration 
duration: 0.757 ns
            Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration 
duration: 6.403 ns
             Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration 
duration: 0.990 ns
        Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration 
duration: 6.286 ns
   Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration 
duration: 0.990 ns

New algorithm:

               Unsigned 32-bit (n mod 3) = 0 - Pass - average iteration 
duration: 0.524 ns
                 Signed 32-bit (n mod 3) = 0 - Pass - average iteration 
duration: 0.640 ns
              Unsigned 32-bit (n mod 10) = 0 - Pass - average iteration 
duration: 0.698 ns
                Signed 32-bit (n mod 10) = 0 - Pass - average iteration 
duration: 0.815 ns
             Unsigned 32-bit (n mod 100) = 0 - Pass - average iteration 
duration: 0.640 ns
               Signed 32-bit (n mod 100) = 0 - Pass - average iteration 
duration: 0.640 ns
             Unsigned 32-bit (n mod 400) = 0 - Pass - average iteration 
duration: 0.582 ns
           Unsigned 32-bit (n mod 1,000) = 0 - Pass - average iteration 
duration: 0.582 ns
               Unsigned 64-bit (n mod 3) = 0 - Pass - average iteration 
duration: 0.640 ns
                 Signed 64-bit (n mod 3) = 0 - Pass - average iteration 
duration: 0.815 ns
              Unsigned 64-bit (n mod 10) = 0 - Pass - average iteration 
duration: 0.815 ns
            Signed 64-bit (n mod 10,000) = 0 - Pass - average iteration 
duration: 0.873 ns
             Unsigned 64-bit (n mod 100) = 0 - Pass - average iteration 
duration: 0.815 ns
        Signed 64-bit (n mod 86,400,000) = 0 - Pass - average iteration 
duration: 0.873 ns
   Unsigned 64-bit (n mod 1,000,000,000) = 0 - Pass - average iteration 
duration: 0.757 ns

I tend to shave off a few fractions of a nanosecond for unsigned modulus 
operations, while signed modulus operations, which still use IDIV 
internally due to some awkwardness with how moduli are calculated (which 
still needs to be resolved), the saving is absolutely massive.

64-bit is a little slower than 32-bit possibly because the code size is 
quite large due to there being 3 different 64-bit constants that need to 
be loaded (for 32-bit and under, these constants can be directly encoded 
in the individual instructions as immediates).  Since the 3rd constant 
is just the 2nd one having been bit-shifted, a better approach would be 
to temporarily store the second constant in a register and then shift it 
at the same time as another mathematical operation (thus using two ALU 
ports to execute them simultaneously).  As specified in Hacker's 
Delight, this is also the recommended approach for RISC processors such 
as AArch64 where encoding 64-bit constants takes up to 4 instructions.  
This, however, would require the use of a tempref, something I'm still 
researching.

Under x86_64, the use of a shift instead of a load could be done using a 
peephole optimisation.  For example, the code generated for signed 
64-bit "(n mod 3) = 0" is currently:

     movq    %rax,%r8
     movq    $-6148914691236517205,%r11
     imulq    %r11,%r8
     movq    $3074457345618258602,%r11
     addq    %r11,%r8
     movq    $6148914691236517204,%r11
     cmpq    %r11,%r8

With a small addition to DeepMOVOpt, the peephole optimizer could easily 
spot that 6148914691236517204 is exactly double 3074457345618258602, and 
change the second mov instruction to "shlq $1,%r11", which only requires 
3 bytes to store (4 if the shift is something other than 1), compared to 
"movq $6148914691236517204,%r11" which requires 10.  As long as the 
original value is used at some point (in this case, via "addq 
%r11,%r8"), it takes the same number of cycles to execute.

I'm waiting until my last division patches are uploaded before opening 
an issue because I'm adding quite a few new tests to tests/bench/bdiv 
and I want to minimise merge conflicts.  That and I still need to test 
how range checks affect the compilation, since the internal 
multiplications deliberately overflow.

Gareth aka. Kit

On 10/09/2021 21:59, J. Gareth Moreton via fpc-devel wrote:
> I suppose in truth, I can, and that in itself is probably fairly 
> cross-platform (although I'll stick with x86 for the moment and get 
> that working).  Sometimes the simple solution eludes me!  Is there 
> anything I need to take into account when it comes to range checking 
> (that is, if a third party tries to compile a unit with range checking 
> enabled), since "numerator * $AAAAAAAB" when constrained to 32 bits 
> will almost always overflow?
>
> Gareth aka. Kit
>
> On 10/09/2021 20:53, Florian Klämpfl via fpc-devel wrote:
>> Am 10.09.21 um 21:17 schrieb J. Gareth Moreton via fpc-devel:
>>> Hi everyone,
>>>
>>> I'm looking at ways to optimise div and mod, starting with x86 and 
>>> then probably AArch64.  The obvious one is attempting to merge "Q := 
>>> N div D; R := N mod D;", where D is a variable (but invariant 
>>> between the two instructions), since DIV returns the quotient in 
>>> R/EAX and the remainder in R/EDX in a single operation, or 
>>> converting the latter equation to "R := N - (Q * D);" if D is a 
>>> constant.
>>>
>>> However, inspired somewhat by "Hacker's Delight", I would like to 
>>> first see if I can optimise the Boolean condition "(X mod C) = 0", 
>>> where C is a constant.  By calculating the multiplicative reciprocal 
>>> of C (it may or may not be equal to the 'magic div' constant), you 
>>> can perform it with just a multiplication and a comparison - for 
>>> example, when dividing by 3 and returning the remainder:
>>>
>>> mov (numerator),%reg1
>>> mov $AAAAAAAB,%reg2 { 3 * $AAAAAAAB = 1 (mod 2^32) }
>>> imul %reg1,%reg2
>>> cmp $55555555,%reg2 { 2^32 div 3 = $55555555 }
>>>
>>> If %reg2 is less than or equal to $55555555, then the numerator is 
>>> an exact multiple of 3, and if it's greater, then it is not an exact 
>>> multiple.  The proof for this is explained in Hacker's Delight, but 
>>> relies on the fact that 3 and 2^32 are relatively prime and the 
>>> exact multiples of 3 multiplied by 3's reciprocal modulo 2^32 map 
>>> onto the values 0 to $55555555 (if the divisor is even, which means 
>>> it's not relatively prime to 2^32, you have to do a bit of trickery 
>>> with a bit rotation, but done properly, it's only 1 extra instruction).
>>>
>>> I'm trying to think of a way to make this clean and flexible, 
>>> especially where future expansion is concerned.  One idea I had was 
>>> to create a new platform-specific node such as "tx86divisible", 
>>> which takes an integer variable (x) and an integer constant (c) and 
>>> returns True if x mod c = 0, and "(X mod C) = 0" code is converted 
>>> to this node via tx86addnode.simplify (the node used for 
>>> comparisons), so it can be quickly converted into the optimal code 
>>> in pass_generate_code.  The other option is to do this conversion in 
>>> pass_generate_code, where a new node type isn't required but might 
>>> be a little trickier to make cross-platform... if it's possible to 
>>> make "tx86divisible" completely cross-platform - that is, have an 
>>> implementation on every target - the node conversion code only has 
>>> to exist in a single place, thus improving maintainability.
>>>
>>> What do you suggest?
>>
>> Can't you generate a mul and cmp node in tx86addnode.simplify which 
>> simulates this behavior?
>> _______________________________________________
>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus