[fpc-devel] Prototype optimisation... Sliding Window

Fri Mar 11 13:43:27 CET 2022

Kudos! In the end, did it make much of a difference in the compilation time?

Em sex., 25 de fev. de 2022 às 02:08, J. Gareth Moreton via fpc-devel
<fpc-devel at lists.freepascal.org> escreveu:
>
> I did it!
>
> After a good week of work and getting things wrong, I finally found a
> solution that works nicely and is extensible, at least for x86.  A bit
> of refactoring and it can be ported to other platforms.  I'm just
> running the test suites to see if I can break things now.  Honestly the
> hard part was making sure all the registers were tracked properly.
>
> https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window
>
> Please feel free to test it and break it!
>
> It will likely undergo adjustments over time to refactor bits and
> pieces, add extra instructions (especially as it doesn't support SHRX
> and SHLX yet, for example) and see if I can make the sliding window more
> efficient - I increased it in size from 8 to 16 and then 32 entries,
> since even at 16, some optimisations were missed in the RTL, but this
> depends on a number of factors.
>
> It's gotten some pretty good optimisations.  On x86_64-win64 under -O4,
> the three lazarus binaries... before:
>
> lazarus.exe:
>
> 313,990,206
> 313,873,982
>
> lazbuild.exe
>
> 59,704,391
> 59,725,895
>
> startlazarus.exe
>
> 27,471,147
> 27,461,419
>
> And the compiler itself, ppcx64.exe:
>
> 3,777,536
> 3,766,272
>
> Remember that though I call the component a "sliding window", it's only
> because it's very similar to how LZ77 finds start points for run-length
> encoding, and the comments and variable names mention run-length
> encoding as it scans sequences of identical instructions.  However, at
> no point is it actually performing data compression, and the end-result
> is common subexpression elimination.  All the space savings are from the
> removal of redundant sequences of instructions, giving both a space and
> a speed boost.
>
> Almost every source file in the compiler and the RTL shows some kind of
> improvement.  A lot of them are just redundant pointer deallocations, so
> this will help with cache misses and the like - that aside though, here
> are a couple of my favourites... one from dbgdwarf -
> TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:
>
> Before:
>
>      ...
> .Lj682:
>      leaq    (,%r13,8),%rcx
>      movq    120(%rdi),%rax
>      cqto
>      idivq    %rcx
>      imulq    %r13,%rax
>      movq    %rax,%r12
>      addq    56(%rbp),%r12
>      leaq    (,%r13,8),%rcx
>      movq    120(%rdi),%rax
>      cqto
>      idivq    %rcx
>      movq    %rdx,%rsi
>      cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
>
> After:
>
>      ...
> .Lj682:
>      leaq    (,%r13,8),%rcx
>      movq    120(%rdi),%rax
>      cqto
>      idivq    %rcx
>      imulq    %r13,%rax
>      movq    %rax,%r12
>      addq    56(%rbp),%r12
>      movq    %rdx,%rsi
>      cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
>
> This one has successfully removed an IDIV instruction because the
> algorithm was able to detect that the subroutine wanted the remainder in
> %edx, and it was still available from the first IDIV call because it
> hadn't been overwritten, and neither %r13 nor %rdi had changed values so
> the references are the same.
>
> This one is from SysUtils' IsLeapYear function and is one I have
> personally wanted to optimise further ever since I first saw its
> disassembly after I implemented the fast "x mod const = 0" algorithm.
>
> Before:
>
>      ...
>      imulw    $23593,%cx,%ax
>      rorw    $2,%ax
>      cmpw    $655,%ax
>      ja    .Lj6979
>      imulw    $23593,%cx,%ax
>      rorw    $4,%ax
>      cmpw    $163,%ax
>      setbeb    %al
>      ret
> .Lj6979:
>      ...
>
> After:
>
>      ...
>      imulw    $23593,%cx,%ax
>      rorw    $2,%ax
>      cmpw    $655,%ax
>      ja    .Lj6979
>      rorw    $2,%ax
>      cmpw    $163,%ax
>      setbeb    %al
>      ret
> .Lj6979:
>      ...
>
> In this case, the RLE doesn't produce an exact match since the end
> result of %ax is different, but the important point to note is that as
> long as %cx doesn't change value (which it doesn't). the first sequence
> can be transformed into the second sequence via a "rorw $2,%ax"
> instruction, so the end result is that "imulw $23593,%cx,%ax; rorw
> $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of
> %ax, thus removing a multiplication.
>
> Because this has been a mammoth undertaking and is quite a major
> addition to the compiler, I'm going to hold off making a merge request
> until I write a document on it, like I've done in the past with some of
> the other larger optimisations I've made.  The branch is available at
> the link at the top of this e-mail though if anyone wants to take a look
> at it.
>
> One final note is that this optimisation is rather slow and specialist,
> so much so that I've added a new optimizer switch named 'cs_opt_asmcse'
> ("Assembly CSE", to differentiate it from "Node CSE"); this is enabled
> by default with -O3, and requires the peephole optimizer to also be
> enabled in order to function.
>
> Gareth aka. Kit
>
>
> --
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel