[fpc-devel] Optimisation question

Mon Oct 30 22:10:45 CET 2023

Hi everyone,

I'm still exploring optimisations in generated x86 code, something which 
has become my speciality, and I found one new potential optimisation 
sequence that aims to reduce unnecessary calls to CMP and TEST when the 
result is already known.  However there are some situations where I'm 
not sure if the end result is better or not.  For example (taken from 
the cgiprotocol unit):

.Lj5:
     subl    $1,%esi
.Lj6:
     testl    %esi,%esi
     jng    .Lj9
     ...
     testl    %eax,%eax
     jne    .Lj5
.Lj9:
     movl    $-1,%eax
     testl    %esi,%esi
     cmovlel    %eax,%esi
     movl    %esi,%eax

In this instance, if the first TEST instruction results in "jng .Lj9" 
branching, it soon calls another TEST instruction with the same 
register, which will not have changed value, so CMOVLE will definitely 
set %esi to %eax (equal to -1) because "le" is equal to "ng" ("less than 
or equal" versus "not greater"), and then %esi is written back to %eax.  
All in all, this is a dependency chain of length 3.  When my new 
optimisation takes place, the following is generated instead:

.Lj5:
     subl    $1,%esi
.Lj6:
     testl    %esi,%esi
     jng    .Lj9
     ...
     testl    %eax,%eax
     jne    .Lj5
     testl    %esi,%esi
     jnle    .Lj12
.Lj9:
     movl    $-1,%esi
.Lj12:
     movl    %esi,%eax

Here, a number of other optimisations don't take place (hence why the 
CMOV instruction no longer exists), but now if "jng .Lj9" branches, it 
sets %esi to -1 directly before writing the result to %eax... a 
dependency chain of just 2.  However, there is also an extra conditional 
jump, which can no longer be optimised into CMOV because of the position 
of the .Lj9 label.

The question is... is the new code better, about the same or worse?  On 
modern processors most, TEST/Jcc can be macro-fused, but jumps always 
cause some penalty.

There are cases where the optimisation gives clear benefits - for 
example, in the cclasses unit - before:

     movq    %rcx,%rdx
     testq    %rcx,%rcx
     je    .Lj382
     movq    -8(%rcx),%rdx
.Lj382:
     testq    %rcx,%rcx
     jne    .Lj383
     leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
     call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

After:

     movq    %rcx,%rdx
     testq    %rcx,%rcx
     je    .Lj382
     movq    -8(%rcx),%rdx
     jmp    .Lj383
.Lj382:
     leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
     call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

In this case, if "je .Lj382" branches, then %rcx is definitely zero and 
so control flow can safely skip straight to the LEA instruction, since 
the "jne .Lj383" instruction will definitely not branch.  On the other 
hand, if "je .Lj382" does not branch, then %rcx is definitely non-zero 
and so the second "testq %rcx,%rcx" is once again deterministic, meaning 
"jne .Lj383" will definitely branch, therefore the second TEST 
instruction can be removed completely and the conditional jump turned 
into an unconditional jump.  The end result is that the number of jumps 
hasn't changed (exactly one jump will be taken regardless of the value 
of %rcx), but the instruction count has been reduced (note that %rdx 
can't be optimised out because its value is used as an actual parameter 
for the CALL).

Kit