[fpc-devel] Ordinal optimisation question

Thu Dec 2 06:36:08 CET 2021

Hi everyone,

I'm playing around with some node-level optimisations because I've 
noticed that some Int64 operations on 32-bit platforms can be sped up if 
the arithmetic is changed slightly.  For example, when performing x + x 
with Int64s, it's faster to do x shl 1, especially if x is on the stack.

On i386, the instructions boil down to, for example, "addl %eax,%eax" 
and "shll $1,%eax", or maybe "movl (ref),%edx; leal (%edx,%edx),%eax" 
versus "movl (ref),%eax; shll $1,%eax".  In terms of instruction size, 
both "addl %eax,%eax" and "shll $1,%eax" instructions take 2 bytes to 
encode.  In more complex situations, shift instructions seem to be 
friendlier for the peephole optimizer - for example, in sfpu128:

     movl    12(%ebp),%edx
     leal    (%edx,%edx),%eax
     testl   %eax,%eax
     seteb   %al

With a shift instead of an add:

     movl    12(%ebp),%eax
     shll    $1,%eax
     seteb   %al

Of course, with the first example, the peephole optimizer or the code 
generator could be more efficient since %edx is not reused and it could 
just be encoded as "movl 12(%ebp),%edx; addl %eax,%eax; seteb %al", 
since addl sets the flags and hence the testl instruction can be 
dropped.  Where a 64-bit ordinal is concerned though, take a look at 
this snippet when a 64-bit integer is added to itself:

     movl    8(%ebp),%edx
     movl    12(%ebp),%eax
     addl    8(%ebp),%edx
     adcl    12(%ebp),%eax

And with a shift:

     movl    8(%ebp),%edx
     movl    12(%ebp),%eax
     shldl   $1,%edx,%eax
     shll    $1,%edx

In this case, we don't have to read from the stack twice and we also 
drop dependency on the carry flag, which might permit some stronger 
optimisations if, say, it's known that the upper 32 bits are equal to 
zero, and even if the value of %edx is not known, the dependency chain 
is broken as long as register aliasing is used to prevent the 
write-after-read pipeline stall between "shldl $1,%edx,%eax" and "shll 
%1,%edx", something which cannot be done between "addl 8(%ebp),%edx" and 
"adcl 12(%ebp),%eax" because of the carry flag depending on the result 
of "addl".

However, in general, when dealing with ordinals that fit into a CPU 
word, is it better to perform x + x or x shl 1? (I'm assuming x * 2 is 
generally worse).  Not just for i386, but also for ARM-32, for example.  
(I'm also assuming that range and overflow checking is turned off - 
granted, shl does set the carry flag if the last bit that gets shifted 
out was a 1, so "shll $1,%eax" and "addl %eax,%eax" both set the carry 
flag if the most significant bit of %eax is set, but this is only for i386).

Gareth aka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus