[fpc-devel] Ordinal optimisation question
J. Gareth Moreton
gareth at moreton-family.com
Thu Dec 2 06:36:08 CET 2021
Hi everyone,
I'm playing around with some node-level optimisations because I've
noticed that some Int64 operations on 32-bit platforms can be sped up if
the arithmetic is changed slightly. For example, when performing x + x
with Int64s, it's faster to do x shl 1, especially if x is on the stack.
On i386, the instructions boil down to, for example, "addl %eax,%eax"
and "shll $1,%eax", or maybe "movl (ref),%edx; leal (%edx,%edx),%eax"
versus "movl (ref),%eax; shll $1,%eax". In terms of instruction size,
both "addl %eax,%eax" and "shll $1,%eax" instructions take 2 bytes to
encode. In more complex situations, shift instructions seem to be
friendlier for the peephole optimizer - for example, in sfpu128:
movl 12(%ebp),%edx
leal (%edx,%edx),%eax
testl %eax,%eax
seteb %al
With a shift instead of an add:
movl 12(%ebp),%eax
shll $1,%eax
seteb %al
Of course, with the first example, the peephole optimizer or the code
generator could be more efficient since %edx is not reused and it could
just be encoded as "movl 12(%ebp),%edx; addl %eax,%eax; seteb %al",
since addl sets the flags and hence the testl instruction can be
dropped. Where a 64-bit ordinal is concerned though, take a look at
this snippet when a 64-bit integer is added to itself:
movl 8(%ebp),%edx
movl 12(%ebp),%eax
addl 8(%ebp),%edx
adcl 12(%ebp),%eax
And with a shift:
movl 8(%ebp),%edx
movl 12(%ebp),%eax
shldl $1,%edx,%eax
shll $1,%edx
In this case, we don't have to read from the stack twice and we also
drop dependency on the carry flag, which might permit some stronger
optimisations if, say, it's known that the upper 32 bits are equal to
zero, and even if the value of %edx is not known, the dependency chain
is broken as long as register aliasing is used to prevent the
write-after-read pipeline stall between "shldl $1,%edx,%eax" and "shll
%1,%edx", something which cannot be done between "addl 8(%ebp),%edx" and
"adcl 12(%ebp),%eax" because of the carry flag depending on the result
of "addl".
However, in general, when dealing with ordinals that fit into a CPU
word, is it better to perform x + x or x shl 1? (I'm assuming x * 2 is
generally worse). Not just for i386, but also for ARM-32, for example.
(I'm also assuming that range and overflow checking is turned off -
granted, shl does set the carry flag if the last bit that gets shifted
out was a 1, so "shll $1,%eax" and "addl %eax,%eax" both set the carry
flag if the most significant bit of %eax is set, but this is only for i386).
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list