[fpc-devel] Prototype optimisation... Sliding Window

Fri Mar 11 14:33:49 CET 2022

So far I haven't managed to do very accurate timing measurements - I'll 
try to resolve that and get more accurate figures, especially as I'm 
still writing up the whitepaper explaining everything before I make a 
merge request.  Compilation time I'm not hugely concerned about since 
this sliding window feature is only enabled on -O3 and higher, but the 
faster I can get it, the better, especially as this family of Pascal 
compilers pride themselves on compilation speed.

Gareth aka. Kit

On 11/03/2022 12:43, Flávio Etrusco via fpc-devel wrote:
> Kudos! In the end, did it make much of a difference in the compilation time?
>
> Em sex., 25 de fev. de 2022 às 02:08, J. Gareth Moreton via fpc-devel
> <fpc-devel at lists.freepascal.org> escreveu:
>> I did it!
>>
>> After a good week of work and getting things wrong, I finally found a
>> solution that works nicely and is extensible, at least for x86.  A bit
>> of refactoring and it can be ported to other platforms.  I'm just
>> running the test suites to see if I can break things now.  Honestly the
>> hard part was making sure all the registers were tracked properly.
>>
>> https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window
>>
>> Please feel free to test it and break it!
>>
>> It will likely undergo adjustments over time to refactor bits and
>> pieces, add extra instructions (especially as it doesn't support SHRX
>> and SHLX yet, for example) and see if I can make the sliding window more
>> efficient - I increased it in size from 8 to 16 and then 32 entries,
>> since even at 16, some optimisations were missed in the RTL, but this
>> depends on a number of factors.
>>
>> It's gotten some pretty good optimisations.  On x86_64-win64 under -O4,
>> the three lazarus binaries... before:
>>
>> lazarus.exe:
>>
>> 313,990,206
>> 313,873,982
>>
>> lazbuild.exe
>>
>> 59,704,391
>> 59,725,895
>>
>> startlazarus.exe
>>
>> 27,471,147
>> 27,461,419
>>
>> And the compiler itself, ppcx64.exe:
>>
>> 3,777,536
>> 3,766,272
>>
>> Remember that though I call the component a "sliding window", it's only
>> because it's very similar to how LZ77 finds start points for run-length
>> encoding, and the comments and variable names mention run-length
>> encoding as it scans sequences of identical instructions.  However, at
>> no point is it actually performing data compression, and the end-result
>> is common subexpression elimination.  All the space savings are from the
>> removal of redundant sequences of instructions, giving both a space and
>> a speed boost.
>>
>> Almost every source file in the compiler and the RTL shows some kind of
>> improvement.  A lot of them are just redundant pointer deallocations, so
>> this will help with cache misses and the like - that aside though, here
>> are a couple of my favourites... one from dbgdwarf -
>> TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:
>>
>> Before:
>>
>>       ...
>> .Lj682:
>>       leaq    (,%r13,8),%rcx
>>       movq    120(%rdi),%rax
>>       cqto
>>       idivq    %rcx
>>       imulq    %r13,%rax
>>       movq    %rax,%r12
>>       addq    56(%rbp),%r12
>>       leaq    (,%r13,8),%rcx
>>       movq    120(%rdi),%rax
>>       cqto
>>       idivq    %rcx
>>       movq    %rdx,%rsi
>>       cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
>>
>> After:
>>
>>       ...
>> .Lj682:
>>       leaq    (,%r13,8),%rcx
>>       movq    120(%rdi),%rax
>>       cqto
>>       idivq    %rcx
>>       imulq    %r13,%rax
>>       movq    %rax,%r12
>>       addq    56(%rbp),%r12
>>       movq    %rdx,%rsi
>>       cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
>>
>> This one has successfully removed an IDIV instruction because the
>> algorithm was able to detect that the subroutine wanted the remainder in
>> %edx, and it was still available from the first IDIV call because it
>> hadn't been overwritten, and neither %r13 nor %rdi had changed values so
>> the references are the same.
>>
>> This one is from SysUtils' IsLeapYear function and is one I have
>> personally wanted to optimise further ever since I first saw its
>> disassembly after I implemented the fast "x mod const = 0" algorithm.
>>
>> Before:
>>
>>       ...
>>       imulw    $23593,%cx,%ax
>>       rorw    $2,%ax
>>       cmpw    $655,%ax
>>       ja    .Lj6979
>>       imulw    $23593,%cx,%ax
>>       rorw    $4,%ax
>>       cmpw    $163,%ax
>>       setbeb    %al
>>       ret
>> .Lj6979:
>>       ...
>>
>> After:
>>
>>       ...
>>       imulw    $23593,%cx,%ax
>>       rorw    $2,%ax
>>       cmpw    $655,%ax
>>       ja    .Lj6979
>>       rorw    $2,%ax
>>       cmpw    $163,%ax
>>       setbeb    %al
>>       ret
>> .Lj6979:
>>       ...
>>
>> In this case, the RLE doesn't produce an exact match since the end
>> result of %ax is different, but the important point to note is that as
>> long as %cx doesn't change value (which it doesn't). the first sequence
>> can be transformed into the second sequence via a "rorw $2,%ax"
>> instruction, so the end result is that "imulw $23593,%cx,%ax; rorw
>> $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of
>> %ax, thus removing a multiplication.
>>
>> Because this has been a mammoth undertaking and is quite a major
>> addition to the compiler, I'm going to hold off making a merge request
>> until I write a document on it, like I've done in the past with some of
>> the other larger optimisations I've made.  The branch is available at
>> the link at the top of this e-mail though if anyone wants to take a look
>> at it.
>>
>> One final note is that this optimisation is rather slow and specialist,
>> so much so that I've added a new optimizer switch named 'cs_opt_asmcse'
>> ("Assembly CSE", to differentiate it from "Node CSE"); this is enabled
>> by default with -O3, and requires the peephole optimizer to also be
>> enabled in order to function.
>>
>> Gareth aka. Kit
>>
>>
>> --
>> This email has been checked for viruses by Avast antivirus software.
>> https://www.avast.com/antivirus
>>
>> _______________________________________________
>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus