[fpc-devel] Producing assembly with less branches?

J. Gareth Moreton gareth at moreton-family.com
Mon Jul 20 02:37:37 CEST 2020


On 19/07/2020 22:37, Stefan Glienke wrote:
> clang and gcc emit this - I would guess they detect quite some common 
> patterns like this.
>
>  ...
>   cmp     eax, edx
>   mov     edx, -1
>   setg    al
>   movzx   eax, al
>   cmovl   eax, edx
>   ret

I think I can make improvements to that already! (Note the sequence 
above and below are in Intel notation)

CMP   EAX, EDX
MOV   EAX, 0 ; Note: don't use XOR EAX, EAX because this scrambles the 
FLAGS register
MOV   EDX, -1
SETG   AL
CMOVL EAX, EDX
RET

I believe that executes one cycle faster (20% faster for the entire 
sequence) on modern processors because it shortens the dependency chain 
that exists between "SETG AL; MOVZX EAX, AL; CMOVL EAX, EDX". It might 
require some testing though to be sure.

The difficulties with CMOV is that it can only write to registers (and 
not 8-bit ones) and can read from memory addresses, but not write to 
them.  If there are registers free at that point in the code though, one 
could potentially write the constants to temporary registers beforehand, 
and then assign them to the registers that matter via CMOV (e.g. as 
shown above with the -1 value).

I'm all for improving the generated assembly language where I can.  
There are some traps that one has to be careful of though, usually 
involving false dependencies.  For example, when setting registers to 
-1, some compilers would use "OR EAX, -1" instead of "MOV EAX, -1" on 
account of it taking fewer bytes to encode.  Both Visual C++ and GCC did 
this at one point, but this causes a false dependency with the previous 
value of EAX so would incur a performance penalty.

The final thing to remember is that, by default, i386 will produce code 
that will run on the oldest 80386 processors.  CMOV was only introduced 
with the Intel Pentium Pro in 1995.  If compiling for x86_64, or if you 
specify compiler parameters to set the minimum processor support, then 
CMOV will be used.

(It also just made me realise that Pass 2 of the peephole optimiser 
would not work with virtual registers because of CMOV's restriction in 
that it can't write to memory addresses, including the stack)

Gareth aka. Kit



More information about the fpc-devel mailing list