[fpc-devel] Policy on platform-specific compiler code

Sun Oct 18 06:07:03 CEST 2020

Apologies, Jonas; I hope I didn't come off as condescending or rude in 
that last e-mail - I was joking because while my algorithm is faster, a 
few nanoseconds probably isn't going to cut it in the grand scheme of 
things!

I got my code to work, treating "magic add" in the same way as the 
existing algorithm, and further testing under -O4 shows that, while 
still faster, it's not as pronounced for small divisors, but still much 
faster for large divisors.  Of course, being x86_64 assembly language, 
I'm not expecting it to be approved even if it's isolated to a utility 
function, but I have linked a patch and a test program to this e-mail 
for those who are curious and as a proof of concept (tests on 
x86_64-win64 and x86_64-linux show no regressions).

The speed can be fractionally increased with the removal of the two 
internal error checks and the code that checks for powers of two (it's 
there to prevent incorrect results, but the routine shouldn't be called 
at all for powers of two, since they can be translated directly into bit 
shifts).  For the internal errors, one checks to see if the divisor is 
zero (an obviously bad expression), and the other is triggered if the 
total shift is greater than 127, which is the base bit count (e.g. 64) 
plus the base 2 logarithm of the divisor, so it shouldn't ever exceed 
127 unless as new 128-bit interger type is introduced.

One final thing to note is that my algorithm produces different 
reciprocals (magic numbers) and shifts for some divisors, but the test 
program I've supplied shows that they're still correct, and careful 
analysis of the reciprocals will reveal that one is a bit-shift of the 
other (the existing algorithm seems to trim off trailing zero bits and 
reduces the shift to compensate).  Also, for divisors greater than N-1 
bits in size, my algorithm produces a 'magic add' solution while the 
existing one doesn't, but for divisors this size, a simple comparison is 
faster than doing a multiplication and shift, since the quotient will 
always be 0 or 1 in these instances.

Gareth aka. Kit

On 18/10/2020 01:13, J. Gareth Moreton via fpc-devel wrote:
> Well, I think you might be right on this one, Jonas!
>
> I've tested my algorithm against the one used in the compiler. It's 5 
> times faster when used with small divisors (so loop iterations are 
> minimal)... but that amounts to about 15 nanoseconds compared to 75 
> nanoseconds!  Additionally, it treats the "magic add" differently (the 
> reason why it randomly failed).
>
> Gareth aka. Kit
>
> On 16/10/2020 23:14, J. Gareth Moreton via fpc-devel wrote:
>> On 16/10/2020 10:47, Jonas Maebe via fpc-devel wrote:
>>> On 16/10/2020 10:14, J. Gareth Moreton via fpc-devel wrote:
>>>
>>>> Before I go optimising the wrong thing, I have a question to ask.
>>>> What's the policy on platform-specific assembly language in the
>>>> compiler, or any code designed to run on a specific (source) platform
>>>> (and using a more generic implementation otherwise via $ifdef)?  I ask
>>>> because I have a faster algorithm for 
>>>> "calc_divconst_magic_unsigned" in
>>>> 'compiler/cgutils.pas', but it's only able to work because it can take
>>>> advantage of the x86 DIV instruction using RDX:RAX (or EDX:EAX) as a
>>>> double-wide dividend. It is somewhat faster than what currently exists
>>>> because of the lack of a loop whose iteration count is proportional to
>>>> log2(d), where d is the desired divisor (in other words, it's 
>>>> slower the
>>>> bigger the divisor is, whereas my algorithm is constant speed).
>>> In general, there should be no assembly language in the compiler. Ialso
>>>   don't think that's worth it in this case. Unless (or maybe "even if")
>>> your code contains nothing but divisions by constants, I doubt this 
>>> code
>>> has a significant effect on the total compile time.
>>
>> Division by constants has a fairly frequent occurrance in code. For 
>> example, dividing by 10000 whenever Currency is used, and 1000 often 
>> appears in timing measurements (e.g. if t is in milliseconds, then t 
>> div 1000 is seconds and t mod 1000 is the leftover milliseconds).
>>
>> The existing code contains two divisions by a variable (so they can't 
>> be optimised) and a loop that has, at most, N iterations, where N is 
>> the bit size (often 32 or 64).  The loop contains only addition, 
>> subtraction and multiplication, and 3 branches to contend with (not 
>> including the repeat...until jump).  My code contains a single DIV, 
>> but also a BSR which is effectively used to get the base-2 logarithm 
>> of the divisor (also throws an internal error if the divisor is zero, 
>> since this should have been caught already).
>>
>> Granted, you may be right and the saving won't be worth it, not to 
>> mention the additional complexity (and my function currently fails on 
>> certain divisors unexpectedly, so I'll have to do some deeper testing 
>> if just for my own peace of mind!) - only a lot of timing tests will 
>> determine that.  Nevertheless, thanks for providing the article to 
>> calculating the reciprocal though - that's definitely helpful in 
>> understanding what's going on.
>>
>> Gareth aka. Kit
>>
>> _______________________________________________
>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>
>


-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
Index: compiler/cgutils.pas
===================================================================

--- compiler/cgutils.pas	(revision 47116)
+++ compiler/cgutils.pas	(working copy)
@@ -409,7 +409,9 @@
         ad,anc,delta,q1,r1,q2,r2,t: aWord;
         two_N_minus_1: aWord;
       begin
-        assert((d<-1) or (d>1));
+        if not ((d<-1) or (d>1)) then
+          InternalError(2020101601);
+
         two_N_minus_1:=aWord(1) shl (N-1);
 
         ad:=abs(d);
@@ -445,6 +447,207 @@
       end;
 
 
+{$if defined(CPUX86_64) and defined(cpu64bitalu)}
+    { We can take advantage of the RDX:RAX dividend of the x86_64 DIV operand for a faster algorithm }
+    procedure calc_divconst_magic_unsigned(N: byte; d: aWord; out magic_m: aWord; out magic_add: boolean; out magic_shift: byte); assembler; nostackframe;
+      asm
+      {$ifdef MSWINDOWS} { Microsoft x64 ABI }
+        { %cl        = N
+          %rdx       = d
+          %r8        = magic_m address
+          %r9        = magic_add address
+          0x28(%rsp) = magic_shift address
+        }
+
+        cmp  $64,  %cl
+        movzbl %cl,%eax { Preserve %cl }
+
+        { Compute bitmask }
+        setb %r11b { Account for situation where %cl is 64 by setting it to zero initially }
+
+        { Compute floor(log2(d)) }
+        bsr  %rdx, %r10
+
+        movzbl %r11b,%r11d
+        shl  %cl,  %r11
+        sub  $1,   %r11
+        mov  %r11, 8(%rsp) { Store bit mask for later in shadow space }
+
+        mov  $2020101610,%ecx { Prepare possible internal error call }
+        jz   InternalError { InternalError(2020101610) - d was zero }
+
+        { Do a quick power of two check (is a power of two iif d and (d - 1) = 0}
+        lea  -1(%rdx),%r11
+        movzbl %al,%ecx { Restore %ecx }
+        test %rdx, %r11
+
+        { Get address of magic_shift }
+        mov  0x28(%rsp),%r11
+        mov  $1,   %eax
+
+        jnz  .LNotPower2
+
+        sub  %r10b,%cl
+        movb $0,   (%r9)
+        shl  %cl,  %rax
+        mov  %rax, (%r8)
+        movb $0,   (%r11)
+        ret
+
+      .LNotPower2:
+        mov  %r10b,(%r11) { We already have the shift, so write it out now (allows us to reuse %r11) }
+
+        { Combine the bit shifts }
+        add  %r10b,%cl
+
+        { Preserve d since we need to use %rdx specifically }
+        mov  %rdx, %r10
+        { Preserve N since we need to use %rcx specifically }
+        movzbl %cl,%r11d
+
+        { Note, if we ever need to support a 128-bit ALU, this function will need
+          to be rewritten to accommodate it, otherwise InternalError(2020101611)
+          will be raised. i.e. 64 bits maximum for now }
+        cmp  $127, %r11b
+        mov  $2020101611,%ecx { Prepare possible internal error call }
+        ja   InternalError { InternalError(2020101611) - bit size was too large }
+
+        { Restore %ecx }
+        mov  %r11d,%ecx
+        xor  %edx, %edx
+        shl  %cl,  %rax
+        sub  $64,  %cl
+        jb   .LSkipHighShift
+
+        mov  $1,   %edx
+        { Shifts greater than 64 can cause %rax to take on an unpredictable
+          value (specifically 1 shl (%cl and $3F)), so zero it here }
+        xor  %eax, %eax
+        shl  %cl,  %rdx
+      .LSkipHighShift:
+        { Calculate the initial reciprocal (magic number) }
+        div  %r10
+
+        { Get address of magic_shift }
+        mov  0x28(%rsp),%r11
+
+        { Determine if most-significant fractional bit is 1 or 0 by determining
+          if the remainder >= d/2 }
+        shl  $1,   %rdx
+        jc   .LMagicAddOverflow { If the above shift overflowed, then it definitely is greater than d now }
+        add  $1,   %rdx { This ensures that the carry gets set if the remainder is exactly d/2 }
+        jc   .LMagicAddOverflow { If the above add overflowed, then it definitely is greater than d now }
+        cmp  %rdx, %r10 { Note the order of the operands - this forces the carry bit to be set if %rdx >= %r10 }
+      .LMagicAddOverflow:
+        setnc (%r9)
+        jc   .LMagicAddEnd
+
+        { Shift the reciprocal up by 1 bit (the msb gets lost, but this is retrieved via the addition after multiplication }
+        shl  $1,   %rax
+        { Adjust magic_shift }
+        addb $1,   (%r11)
+      .LMagicAddEnd:
+        { Set magic add if the carry is clear }
+        add  $1,   %rax { Increment the reciprocal }
+        and  8(%rsp),%rax { Apply the bitmask to the reciprocal }
+        mov  %rax, (%r8)
+
+      {$else MSWINDOWS} { System V ABI }
+
+        { %dil       = N
+          %rsi       = d
+          %rdx       = magic_m address
+          %rcx       = magic_add address
+          %r8        = magic_shift address
+        }
+
+        cmp  $64,  %dil
+
+        { Preserve %rcx since we need to use it specifically }
+        mov  %rcx, %r9
+
+        { Compute bitmask }
+        setb %r11b { Account for situation where %cl is 64 by setting it to zero initially }
+
+        { Compute floor(log2(d)) }
+        bsr  %rsi, %r10
+
+        movzbl %dil,%ecx { We have to use %cl for shl }
+        movzbl %r11b,%r11d
+        shl  %cl,  %r11
+        sub  $1,   %r11
+
+        mov  $2020101610,%edi { Prepare possible internal error call }
+        jz   InternalError { InternalError(2020101610) - d was zero }
+
+        { Do a quick power of two check (is a power of two iif d and (d - 1) = 0}
+        lea  -1(%rsi),%rdi
+        mov  $1,   %eax
+        test %rsi, %rdi
+
+        jnz  .LNotPower2
+
+        sub  %r10b,%cl
+        movb $0,   (%r8)
+        shl  %cl,  %rax
+        movb $0,   (%r9)
+        mov  %rax, (%rdx)
+        ret
+
+      .LNotPower2:
+        { Combine the bit shifts }
+        add  %r10b,%cl
+
+        { Note, if we ever need to support a 128-bit ALU, this function will need
+          to be rewritten to accommodate it, otherwise InternalError(2020101611)
+          will be raised. i.e. 64 bits maximum for now }
+        cmp  $127, %cl
+        mov  $2020101611,%edi { Prepare possible internal error call }
+        ja   InternalError { InternalError(2020101611) - bit size was too large }
+
+        { Preserve %rdx since we need to use that register specifically }
+        mov  %rdx, %rdi
+
+        xor  %edx, %edx
+        shl  %cl,  %rax
+        sub  $64,  %cl
+        jb   .LSkipHighShift
+
+        mov  $1,   %edx
+        { Shifts greater than 64 can cause %rax to take on an unpredictable
+          value (specifically 1 shl (%cl and $3F)), so zero it here }
+        xor  %eax, %eax
+        shl  %cl,  %rdx
+      .LSkipHighShift:
+        { Calculate the initial reciprocal (magic number) }
+        div  %rsi
+
+        { Determine if most-significant fractional bit is 1 or 0 by determining
+          if the remainder >= d/2 }
+        shl  $1,   %rdx
+        jc   .LMagicAddOverflow { If the above shift overflowed, then it definitely is greater than d now }
+        add  $1,   %rdx { This ensures that the carry gets set if the remainder is exactly d/2 }
+        jc   .LMagicAddOverflow { If the above add overflowed, then it definitely is greater than d now }
+        cmp  %rdx, %rsi { Note the order of the operands - this forces the carry bit to be set if %rdx >= %rsi }
+      .LMagicAddOverflow:
+        setnc (%r9)
+        jc   .LMagicAddEnd
+
+        { Shift the reciprocal up by 1 bit (the msb gets lost, but this is retrieved via the addition after multiplication }
+        shl  $1,   %rax
+        { Adjust magic_shift }
+        add  $1,   %r10b
+      .LMagicAddEnd:
+
+        add  $1,   %rax { Increment the reciprocal }
+        mov  %r10b, (%r8) { Write magic shift }
+        and  %r11, %rax { Apply the bitmask to the reciprocal }
+        mov  %rax, (%rdi)
+      {$endif MSWINDOWS}
+      end;
+
+{$else CPUX86_64 and cpu64bitalu}
+
     procedure calc_divconst_magic_unsigned(N: byte; d: aWord; out magic_m: aWord; out magic_add: boolean; out magic_shift: byte);
       var
         p: aInt;
@@ -494,6 +697,7 @@
         magic_m:=(q2+1) and mask;        { resulting magic number }
         magic_shift:=p-N;     { resulting shift }
       end;
+{$endif CPUX86_64 and cpu64bitalu}
 {$pop}
 
 end.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testprog.zip
Type: application/x-zip-compressed
Size: 3721 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20201018/0f498b66/attachment.bin>