[fpc-devel] Prototype optimisation... Sliding Window
J. Gareth Moreton
gareth at moreton-family.com
Fri Feb 25 06:08:48 CET 2022
I did it!
After a good week of work and getting things wrong, I finally found a
solution that works nicely and is extensible, at least for x86. A bit
of refactoring and it can be ported to other platforms. I'm just
running the test suites to see if I can break things now. Honestly the
hard part was making sure all the registers were tracked properly.
https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window
Please feel free to test it and break it!
It will likely undergo adjustments over time to refactor bits and
pieces, add extra instructions (especially as it doesn't support SHRX
and SHLX yet, for example) and see if I can make the sliding window more
efficient - I increased it in size from 8 to 16 and then 32 entries,
since even at 16, some optimisations were missed in the RTL, but this
depends on a number of factors.
It's gotten some pretty good optimisations. On x86_64-win64 under -O4,
the three lazarus binaries... before:
lazarus.exe:
313,990,206
313,873,982
lazbuild.exe
59,704,391
59,725,895
startlazarus.exe
27,471,147
27,461,419
And the compiler itself, ppcx64.exe:
3,777,536
3,766,272
Remember that though I call the component a "sliding window", it's only
because it's very similar to how LZ77 finds start points for run-length
encoding, and the comments and variable names mention run-length
encoding as it scans sequences of identical instructions. However, at
no point is it actually performing data compression, and the end-result
is common subexpression elimination. All the space savings are from the
removal of redundant sequences of instructions, giving both a space and
a speed boost.
Almost every source file in the compiler and the RTL shows some kind of
improvement. A lot of them are just redundant pointer deallocations, so
this will help with cache misses and the like - that aside though, here
are a couple of my favourites... one from dbgdwarf -
TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:
Before:
...
.Lj682:
leaq (,%r13,8),%rcx
movq 120(%rdi),%rax
cqto
idivq %rcx
imulq %r13,%rax
movq %rax,%r12
addq 56(%rbp),%r12
leaq (,%r13,8),%rcx
movq 120(%rdi),%rax
cqto
idivq %rcx
movq %rdx,%rsi
cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
After:
...
.Lj682:
leaq (,%r13,8),%rcx
movq 120(%rdi),%rax
cqto
idivq %rcx
imulq %r13,%rax
movq %rax,%r12
addq 56(%rbp),%r12
movq %rdx,%rsi
cmpb $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)
This one has successfully removed an IDIV instruction because the
algorithm was able to detect that the subroutine wanted the remainder in
%edx, and it was still available from the first IDIV call because it
hadn't been overwritten, and neither %r13 nor %rdi had changed values so
the references are the same.
This one is from SysUtils' IsLeapYear function and is one I have
personally wanted to optimise further ever since I first saw its
disassembly after I implemented the fast "x mod const = 0" algorithm.
Before:
...
imulw $23593,%cx,%ax
rorw $2,%ax
cmpw $655,%ax
ja .Lj6979
imulw $23593,%cx,%ax
rorw $4,%ax
cmpw $163,%ax
setbeb %al
ret
.Lj6979:
...
After:
...
imulw $23593,%cx,%ax
rorw $2,%ax
cmpw $655,%ax
ja .Lj6979
rorw $2,%ax
cmpw $163,%ax
setbeb %al
ret
.Lj6979:
...
In this case, the RLE doesn't produce an exact match since the end
result of %ax is different, but the important point to note is that as
long as %cx doesn't change value (which it doesn't). the first sequence
can be transformed into the second sequence via a "rorw $2,%ax"
instruction, so the end result is that "imulw $23593,%cx,%ax; rorw
$4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of
%ax, thus removing a multiplication.
Because this has been a mammoth undertaking and is quite a major
addition to the compiler, I'm going to hold off making a merge request
until I write a document on it, like I've done in the past with some of
the other larger optimisations I've made. The branch is available at
the link at the top of this e-mail though if anyone wants to take a look
at it.
One final note is that this optimisation is rather slow and specialist,
so much so that I've added a new optimizer switch named 'cs_opt_asmcse'
("Assembly CSE", to differentiate it from "Node CSE"); this is enabled
by default with -O3, and requires the peephole optimizer to also be
enabled in order to function.
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list