[fpc-devel] Optimisation challenge!

Thu Dec 9 01:43:40 CET 2021

Hi everyone,

I've noticed a bit of a deep potential optimisation for faster code:

     jne     .Lj1806
     movb    $13,-40(%rbp)
     cmpq    $0,-32(%rbp)
     je      .Lj1798
     ...
.Lj1806:
     movb    $1,-40(%rbp)
.Lj1798:
     cmpb    $1,-40(%rbp)
     jne     .Lj1812

If you analyse the jumps and stack values carefully (which could be 
considered registers for sake of simplicity), you can make an assumption 
and shortcut a jump in the program flow:

- Prior to "je .Lj1798", the value of -40(%rbp) is set to 13.
- If the program flow jumps here, "cmp $1,-40(%rbp)" sets flags such 
that "jne .Lj1812" will definitely branch, because -40(%rbp) is equal to 
13 and not 1.
- Therefore, the original "je .Lj1798" branch can be redirected to "je 
.Lj1812"

In other words:

     jne     .Lj1806
     movb    $13,-40(%rbp)
     cmpq    $0,-32(%rbp)
     je      .Lj1812
     ...
.Lj1806:
     movb    $1,-40(%rbp)
.Lj1798:
     cmpb    $1,-40(%rbp)
     jne     .Lj1812

This block behaves exactly the same way, but we've removed a comparison 
and a conditional jump from the program flow, which will almost 
certainly result in a speed increase due to fewer instructions being 
executed and less pressure on the branch predictor.

The question and challenge here though... how could one implement this 
kind of assembly-level data-flow analysis? Stopping at "je .Lj1812" 
jumping to the label and then looking ahead to see "cmpb $1,-40(%rbp); 
jne .Lj1812" is expensive enough, but then also looking backwards to 
find a matching MOV instruction is even more expensive.  Similarly, 
looking forward from "movb $13,-40(%rbp)" and analysing every 
conditional jump you come across is rather prohibitive too.  If that is 
hard enough, look at the block of code from aasmcpu.s with the ... 
filled in:

     movb    $13,-40(%rbp)
     cmpq    $0,-32(%rbp)
     je      .Lj1798
     movq    -16(%rbp),%rcx
     call    AASMCPU$_$TAICPU_$__$$_INSEND$$LONGINT
     movslq  %eax,%rax
     movq    -8(%rbp),%rdx
     movq    24(%rdx),%rdx
     subq    56(%rdx),%rax
     movzbl  -63(%rbp),%edx
     subq    %rdx,%rax
     subq    %rax,-24(%rbp)
     jmp     .Lj1798
     .p2align 4,,10
     .p2align 3
.Lj1806:
     movb    $1,-40(%rbp)
.Lj1798:
     cmpb    $1,-40(%rbp)
     jne     .Lj1812

(Note that the "cmpb $1,-40(%rbp)" instruction is caused by an 
optimisation that I'm currently developing and won't appear as such yet 
- it currently appears as "movzbl -40(%rbp),%eax; cmpl $1,%eax")

The same logic applies to "jmp .Lj1798", in that -40(%rbp) is still 
equal to 13 and can be redirected to "jmp .Lj1812" (assuming 
AASMCPU$_$TAICPU_$__$$_INSEND$$LONGINT isn't malformed).  In fact, for 
this particular procedure, every jump to .Lj1798 sets -40(%rbp) to 
something other than 1 almost immediately prior; it only gets set to 1 
immediately after label .Lj1806, so logically, .Lj1798 could become a 
dead label, and then "movb $1,-40(%rbp); cmpb $1,-40(%rbp)" are now 
adjacent and deterministic, meaning the "jne .Lj1812" will never branch 
and can be removed.  But putting that aside for the moment, how can one 
track the value of -40(%rbp) through all those instructions to the jump, 
especially through a call?

Of course, it could probably be brute-forced, but that could easily be 
on the order of O(n²) if not worse.  Some conservative assumptions can 
be made to terminate a search early, like if %rbp or %rsp (whichever is 
applicable) changes value or a write to general memory occurs (unless 
it's the stack), since the register that's dereferenced could easily be 
made to equal the stack pointer.

Let theorycrafting begin!

Gareth aka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus