[fpc-devel] Optimisation challenge!
J. Gareth Moreton
gareth at moreton-family.com
Thu Dec 9 01:43:40 CET 2021
Hi everyone,
I've noticed a bit of a deep potential optimisation for faster code:
jne .Lj1806
movb $13,-40(%rbp)
cmpq $0,-32(%rbp)
je .Lj1798
...
.Lj1806:
movb $1,-40(%rbp)
.Lj1798:
cmpb $1,-40(%rbp)
jne .Lj1812
If you analyse the jumps and stack values carefully (which could be
considered registers for sake of simplicity), you can make an assumption
and shortcut a jump in the program flow:
- Prior to "je .Lj1798", the value of -40(%rbp) is set to 13.
- If the program flow jumps here, "cmp $1,-40(%rbp)" sets flags such
that "jne .Lj1812" will definitely branch, because -40(%rbp) is equal to
13 and not 1.
- Therefore, the original "je .Lj1798" branch can be redirected to "je
.Lj1812"
In other words:
jne .Lj1806
movb $13,-40(%rbp)
cmpq $0,-32(%rbp)
je .Lj1812
...
.Lj1806:
movb $1,-40(%rbp)
.Lj1798:
cmpb $1,-40(%rbp)
jne .Lj1812
This block behaves exactly the same way, but we've removed a comparison
and a conditional jump from the program flow, which will almost
certainly result in a speed increase due to fewer instructions being
executed and less pressure on the branch predictor.
The question and challenge here though... how could one implement this
kind of assembly-level data-flow analysis? Stopping at "je .Lj1812"
jumping to the label and then looking ahead to see "cmpb $1,-40(%rbp);
jne .Lj1812" is expensive enough, but then also looking backwards to
find a matching MOV instruction is even more expensive. Similarly,
looking forward from "movb $13,-40(%rbp)" and analysing every
conditional jump you come across is rather prohibitive too. If that is
hard enough, look at the block of code from aasmcpu.s with the ...
filled in:
movb $13,-40(%rbp)
cmpq $0,-32(%rbp)
je .Lj1798
movq -16(%rbp),%rcx
call AASMCPU$_$TAICPU_$__$$_INSEND$$LONGINT
movslq %eax,%rax
movq -8(%rbp),%rdx
movq 24(%rdx),%rdx
subq 56(%rdx),%rax
movzbl -63(%rbp),%edx
subq %rdx,%rax
subq %rax,-24(%rbp)
jmp .Lj1798
.p2align 4,,10
.p2align 3
.Lj1806:
movb $1,-40(%rbp)
.Lj1798:
cmpb $1,-40(%rbp)
jne .Lj1812
(Note that the "cmpb $1,-40(%rbp)" instruction is caused by an
optimisation that I'm currently developing and won't appear as such yet
- it currently appears as "movzbl -40(%rbp),%eax; cmpl $1,%eax")
The same logic applies to "jmp .Lj1798", in that -40(%rbp) is still
equal to 13 and can be redirected to "jmp .Lj1812" (assuming
AASMCPU$_$TAICPU_$__$$_INSEND$$LONGINT isn't malformed). In fact, for
this particular procedure, every jump to .Lj1798 sets -40(%rbp) to
something other than 1 almost immediately prior; it only gets set to 1
immediately after label .Lj1806, so logically, .Lj1798 could become a
dead label, and then "movb $1,-40(%rbp); cmpb $1,-40(%rbp)" are now
adjacent and deterministic, meaning the "jne .Lj1812" will never branch
and can be removed. But putting that aside for the moment, how can one
track the value of -40(%rbp) through all those instructions to the jump,
especially through a call?
Of course, it could probably be brute-forced, but that could easily be
on the order of O(n²) if not worse. Some conservative assumptions can
be made to terminate a search early, like if %rbp or %rsp (whichever is
applicable) changes value or a write to general memory occurs (unless
it's the stack), since the register that's dereferenced could easily be
made to equal the stack pointer.
Let theorycrafting begin!
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list