[fpc-devel] More peephole optimisation questions

Tue Apr 19 21:03:11 CEST 2022

Hi everyone,

So this is another question on peephole optimisation for x86_64. 
Occasionally you get situations where you write a load of constants to 
the stack - in this case it's part of an array parameter to a function call:

     movl    $23199763,32(%rsp)
     movl    $262149,36(%rsp)
     movl    $33816983,40(%rsp)
     movl    $36176315,44(%rsp)
     movl    $50660102,48(%rsp)
     movl    $65340390,52(%rsp)

x86_64 doesn't support writing a 64-bit constant directly to memory, and 
you have to instead write it to a register first. With that in mind, is 
the following code faster?

     movq    $1125921404878867,%eax
     movq    %eax,32(%rsp)
     movq    $155376089848611223,%eax
     movq    %eax,40(%rsp)
     movq    $280634838208545542,%eax
     movq    %eax,48(%rsp)

I know there will be a pipeline stall between the first two 
instructions, but logic tells me that parallelisation, out-of-order 
execution and register renaming will ensure that loading %eax with the 
next immediate can happen at the same time as its previous value is 
being written to memory.  I know there are a lot of variables, like how 
smart the processor is and how many ALUs and AGUs are available, so 
that's why I'm after a second opinion before I start proposing an 
optimisation that's speculative at best.  If necessary, I could even do 
this (if the registers are available):

     movq    $1125921404878867,%eax
     movq    $155376089848611223,%ecx
     movq    $280634838208545542,%edx
     movq    %eax,32(%rsp)
     movq    %ecx,40(%rsp)
     movq    %edx,48(%rsp)

At the very least I'm pretty sure it's not worth it to concatenate a 
single pair of 32-bit immediates.  For example, if it was just the first 
two:

     movl    $23199763,32(%rsp)
     movl    $262149,36(%rsp)

... it would not be worth it to transmute them into:

     movq    $1125921404878867,%eax
     movq    %eax,32(%rsp)

Since in the former case, the two can be executed in parallel and the 
only barrier is memory latency (almost all modern Intel CPUs have at 
least 2 AGUs), while the latter case introduces a dependency.

Gareth aka. Kit

P.S. In this case, the assembly language is generated by this parameter 
in aoptx86: "[A_CMP, A_TEST, A_BSR, A_BSF, A_COMISS, A_COMISD, 
A_UCOMISS, A_UCOMISD, A_VCOMISS, A_VCOMISD, A_VUCOMISS, A_VUCOMISD]"... 
this is part of the CMOV optimisations and is a load of instructions 
that are used for comparisons - if the opcode matches one of the above, 
the peephole optimizer will see if it's possible to position MOV 
instructions before the comparison instead of between the comparison and 
the conditional jump, as this works better for macro-fusion and the 
ability to turn "mov $0,%reg" to "xor %reg,%reg", which cannot be done 
if the FLAGS register is in use (XOR scrambles them), so by moving MOV 
before the comparison, this eliminates that problem.

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus