[fpc-devel] Optimisation question
J. Gareth Moreton
gareth at moreton-family.com
Mon Oct 30 22:10:45 CET 2023
Hi everyone,
I'm still exploring optimisations in generated x86 code, something which
has become my speciality, and I found one new potential optimisation
sequence that aims to reduce unnecessary calls to CMP and TEST when the
result is already known. However there are some situations where I'm
not sure if the end result is better or not. For example (taken from
the cgiprotocol unit):
.Lj5:
subl $1,%esi
.Lj6:
testl %esi,%esi
jng .Lj9
...
testl %eax,%eax
jne .Lj5
.Lj9:
movl $-1,%eax
testl %esi,%esi
cmovlel %eax,%esi
movl %esi,%eax
In this instance, if the first TEST instruction results in "jng .Lj9"
branching, it soon calls another TEST instruction with the same
register, which will not have changed value, so CMOVLE will definitely
set %esi to %eax (equal to -1) because "le" is equal to "ng" ("less than
or equal" versus "not greater"), and then %esi is written back to %eax.
All in all, this is a dependency chain of length 3. When my new
optimisation takes place, the following is generated instead:
.Lj5:
subl $1,%esi
.Lj6:
testl %esi,%esi
jng .Lj9
...
testl %eax,%eax
jne .Lj5
testl %esi,%esi
jnle .Lj12
.Lj9:
movl $-1,%esi
.Lj12:
movl %esi,%eax
Here, a number of other optimisations don't take place (hence why the
CMOV instruction no longer exists), but now if "jng .Lj9" branches, it
sets %esi to -1 directly before writing the result to %eax... a
dependency chain of just 2. However, there is also an extra conditional
jump, which can no longer be optimised into CMOV because of the position
of the .Lj9 label.
The question is... is the new code better, about the same or worse? On
modern processors most, TEST/Jcc can be macro-fused, but jumps always
cause some penalty.
There are cases where the optimisation gives clear benefits - for
example, in the cclasses unit - before:
movq %rcx,%rdx
testq %rcx,%rcx
je .Lj382
movq -8(%rcx),%rdx
.Lj382:
testq %rcx,%rcx
jne .Lj383
leaq FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
call CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD
After:
movq %rcx,%rdx
testq %rcx,%rcx
je .Lj382
movq -8(%rcx),%rdx
jmp .Lj383
.Lj382:
leaq FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
call CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD
In this case, if "je .Lj382" branches, then %rcx is definitely zero and
so control flow can safely skip straight to the LEA instruction, since
the "jne .Lj383" instruction will definitely not branch. On the other
hand, if "je .Lj382" does not branch, then %rcx is definitely non-zero
and so the second "testq %rcx,%rcx" is once again deterministic, meaning
"jne .Lj383" will definitely branch, therefore the second TEST
instruction can be removed completely and the conditional jump turned
into an unconditional jump. The end result is that the number of jumps
hasn't changed (exactly one jump will be taken regardless of the value
of %rcx), but the instruction count has been reduced (note that %rdx
can't be optimised out because its value is used as an actual parameter
for the CALL).
Kit
More information about the fpc-devel
mailing list