[fpc-devel] Prototype optimisation... Sliding Window

Fri Feb 25 06:08:48 CET 2022

I did it!

After a good week of work and getting things wrong, I finally found a 
solution that works nicely and is extensible, at least for x86.  A bit 
of refactoring and it can be ported to other platforms.  I'm just 
running the test suites to see if I can break things now.  Honestly the 
hard part was making sure all the registers were tracked properly.

https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window

Please feel free to test it and break it!

It will likely undergo adjustments over time to refactor bits and 
pieces, add extra instructions (especially as it doesn't support SHRX 
and SHLX yet, for example) and see if I can make the sliding window more 
efficient - I increased it in size from 8 to 16 and then 32 entries, 
since even at 16, some optimisations were missed in the RTL, but this 
depends on a number of factors.

It's gotten some pretty good optimisations.  On x86_64-win64 under -O4, 
the three lazarus binaries... before:

lazarus.exe:

313,990,206
313,873,982

lazbuild.exe

59,704,391
59,725,895

startlazarus.exe

27,471,147
27,461,419

And the compiler itself, ppcx64.exe:

3,777,536
3,766,272

Remember that though I call the component a "sliding window", it's only 
because it's very similar to how LZ77 finds start points for run-length 
encoding, and the comments and variable names mention run-length 
encoding as it scans sequences of identical instructions.  However, at 
no point is it actually performing data compression, and the end-result 
is common subexpression elimination.  All the space savings are from the 
removal of redundant sequences of instructions, giving both a space and 
a speed boost.

Almost every source file in the compiler and the RTL shows some kind of 
improvement.  A lot of them are just redundant pointer deallocations, so 
this will help with cache misses and the like - that aside though, here 
are a couple of my favourites... one from dbgdwarf - 
TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:

Before:

     ...
.Lj682:
     leaq    (,%r13,8),%rcx
     movq    120(%rdi),%rax
     cqto
     idivq    %rcx
     imulq    %r13,%rax
     movq    %rax,%r12
     addq    56(%rbp),%r12
     leaq    (,%r13,8),%rcx
     movq    120(%rdi),%rax
     cqto
     idivq    %rcx
     movq    %rdx,%rsi
     cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

After:

     ...
.Lj682:
     leaq    (,%r13,8),%rcx
     movq    120(%rdi),%rax
     cqto
     idivq    %rcx
     imulq    %r13,%rax
     movq    %rax,%r12
     addq    56(%rbp),%r12
     movq    %rdx,%rsi
     cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

This one has successfully removed an IDIV instruction because the 
algorithm was able to detect that the subroutine wanted the remainder in 
%edx, and it was still available from the first IDIV call because it 
hadn't been overwritten, and neither %r13 nor %rdi had changed values so 
the references are the same.

This one is from SysUtils' IsLeapYear function and is one I have 
personally wanted to optimise further ever since I first saw its 
disassembly after I implemented the fast "x mod const = 0" algorithm.

Before:

     ...
     imulw    $23593,%cx,%ax
     rorw    $2,%ax
     cmpw    $655,%ax
     ja    .Lj6979
     imulw    $23593,%cx,%ax
     rorw    $4,%ax
     cmpw    $163,%ax
     setbeb    %al
     ret
.Lj6979:
     ...

After:

     ...
     imulw    $23593,%cx,%ax
     rorw    $2,%ax
     cmpw    $655,%ax
     ja    .Lj6979
     rorw    $2,%ax
     cmpw    $163,%ax
     setbeb    %al
     ret
.Lj6979:
     ...

In this case, the RLE doesn't produce an exact match since the end 
result of %ax is different, but the important point to note is that as 
long as %cx doesn't change value (which it doesn't). the first sequence 
can be transformed into the second sequence via a "rorw $2,%ax" 
instruction, so the end result is that "imulw $23593,%cx,%ax; rorw 
$4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of 
%ax, thus removing a multiplication.

Because this has been a mammoth undertaking and is quite a major 
addition to the compiler, I'm going to hold off making a merge request 
until I write a document on it, like I've done in the past with some of 
the other larger optimisations I've made.  The branch is available at 
the link at the top of this e-mail though if anyone wants to take a look 
at it.

One final note is that this optimisation is rather slow and specialist, 
so much so that I've added a new optimizer switch named 'cs_opt_asmcse' 
("Assembly CSE", to differentiate it from "Node CSE"); this is enabled 
by default with -O3, and requires the peephole optimizer to also be 
enabled in order to function.

Gareth aka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus