[fpc-devel] More peephole optimisation questions
J. Gareth Moreton
gareth at moreton-family.com
Tue Apr 19 21:03:11 CEST 2022
Hi everyone,
So this is another question on peephole optimisation for x86_64.
Occasionally you get situations where you write a load of constants to
the stack - in this case it's part of an array parameter to a function call:
movl $23199763,32(%rsp)
movl $262149,36(%rsp)
movl $33816983,40(%rsp)
movl $36176315,44(%rsp)
movl $50660102,48(%rsp)
movl $65340390,52(%rsp)
x86_64 doesn't support writing a 64-bit constant directly to memory, and
you have to instead write it to a register first. With that in mind, is
the following code faster?
movq $1125921404878867,%eax
movq %eax,32(%rsp)
movq $155376089848611223,%eax
movq %eax,40(%rsp)
movq $280634838208545542,%eax
movq %eax,48(%rsp)
I know there will be a pipeline stall between the first two
instructions, but logic tells me that parallelisation, out-of-order
execution and register renaming will ensure that loading %eax with the
next immediate can happen at the same time as its previous value is
being written to memory. I know there are a lot of variables, like how
smart the processor is and how many ALUs and AGUs are available, so
that's why I'm after a second opinion before I start proposing an
optimisation that's speculative at best. If necessary, I could even do
this (if the registers are available):
movq $1125921404878867,%eax
movq $155376089848611223,%ecx
movq $280634838208545542,%edx
movq %eax,32(%rsp)
movq %ecx,40(%rsp)
movq %edx,48(%rsp)
At the very least I'm pretty sure it's not worth it to concatenate a
single pair of 32-bit immediates. For example, if it was just the first
two:
movl $23199763,32(%rsp)
movl $262149,36(%rsp)
... it would not be worth it to transmute them into:
movq $1125921404878867,%eax
movq %eax,32(%rsp)
Since in the former case, the two can be executed in parallel and the
only barrier is memory latency (almost all modern Intel CPUs have at
least 2 AGUs), while the latter case introduces a dependency.
Gareth aka. Kit
P.S. In this case, the assembly language is generated by this parameter
in aoptx86: "[A_CMP, A_TEST, A_BSR, A_BSF, A_COMISS, A_COMISD,
A_UCOMISS, A_UCOMISD, A_VCOMISS, A_VCOMISD, A_VUCOMISS, A_VUCOMISD]"...
this is part of the CMOV optimisations and is a load of instructions
that are used for comparisons - if the opcode matches one of the above,
the peephole optimizer will see if it's possible to position MOV
instructions before the comparison instead of between the comparison and
the conditional jump, as this works better for macro-fusion and the
ability to turn "mov $0,%reg" to "xor %reg,%reg", which cannot be done
if the FLAGS register is in use (XOR scrambles them), so by moving MOV
before the comparison, this eliminates that problem.
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list