[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
J. Gareth Moreton
gareth at moreton-family.com
Tue Jan 4 01:06:52 CET 2022
Prepare for a lot of technical rambling!
This is just an analysis of the compilation of utf8lentest.lpr, not any
of the System units. Notably, POPCNT isn't called directly, but instead
goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x
and 3.3.1. A future study of "fpc_popcnt_qword" might yield some
interesting information.
From what I've seen so far, "for i := 1 to cnt do" is implemented
differently, although not in a way that makes it slower, and there are a
few extra MOV instructions that shouldn't cause a speed decrease thanks
to parallel execution. Similarly, alignment hints now take the form of
".p2align 4,,10; .p2align 3" rather than ".balign 8,0x90", which aligns
code to a 16-byte boundary rather than an 8-byte boundary and uses
multi-byte NOP instructions rather than a large number of single-byte
NOP instructions (machine code 0x90), although there's every chance I
configured things incorrectly.
The difficult part is seeing through the different registers, since
3.3.1 is assigning registers differently to 3.2.x. I've attached the
two dumps if you want to compare them yourself (built under -O4).
From what I can observe, the performance loss seems to be due to
changes in code generation causing more pipeline stalls rather than any
particular peephole optimisation. There is room for improvement though
- for example, both compilers have snippets of code akin to the following:
movq %r12,%r10
shrq $8,%r10
movq $71777214294589695,%r11
andq %r11,%r10
movq %r12,%r11
movq $71777214294589695,%r13
andq %r13,%r11
leaq (%r10,%r11),%r12
In this situation, one can replace %r11 with %r13 in the 3rd and 4th
instructions so the massive constant isn't assigned twice:
movq %r12,%r10
shrq $8,%r10
movq $71777214294589695,%r13
andq %r13,%r10
movq %r12,%r11
andq %r13,%r11
leaq (%r10,%r11),%r12
And if %r13 isn't used again in its current form (in this particular
snippet, it isn't... it gets assigned a new constant a few instructions
later), one can do some clever rearranging:
movq %r12,%r10
shrq $8,%r10
movq $71777214294589695,%r11
andq %r11,%r10
andq %r12,%r11
leaq (%r10,%r11),%r12
Granted this is finding new peephole optimisations rather than fixing
the pipeline stalls, but I can see quite a few places where a
rearrangement of registers can produce smaller and more efficient code
by removing redundant transfers.
There is a slight inefficiency in a nested function prologue - under
3.2.x, it's this:
.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"x"
.balign 16,0x90
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
pushq %rbp
.seh_pushreg %rbp
movq %rcx,%rbp
leaq -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
leaq -8(%rbp),%rcx
call fpc_ansistr_decr_ref
leaq -24(%rbp),%rcx
call fpc_ansistr_decr_ref
... whereas under 3.3.1...
.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"ax"
.balign 16,0x90
.globl P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
pushq %rbp
.seh_pushreg %rbp
movq %rcx,%rbp
leaq -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
subq $24,%rcx
call fpc_ansistr_decr_ref
leaq -8(%rbp),%rcx
call fpc_ansistr_decr_ref
In this case, the one in 3.2.x is slightly better and could be improved
by removing "movq %rcx,%rbp" completely, since the value is never used
and is not recorded in the SEH directives (also, the string objects are
decremented in the opposite order, with 3.2.x doing -8(%rbp) first, but
3.3.1 doing -24(%rbp) first instead). I think this may be due to me
putting in code in the peephole optimizer that avoids playing around
with the function prologue. (3.2.x doesn't have the .globl symbol though)
For actual reduction of pipeline stalls, I'm not quite sure how smart
the Intel CPU firmware is and if rearranging instructions would actually
improve throughput or not. For example, on 3.2.x, there are these four
instructions that configure parameters and preserve %rax:
movq %rax,%rsi
movl -12(%rbp),%edx
movl $2,%r8d
leaq -24(%rbp),%rcx
All's okay so far, but on 3.3.1, the ordering is different:
movq %rax,%rsi
movl -12(%rbp),%edx
leaq -24(%rbp),%rcx
movl $2,%r8d
If one assumes the instructions are executed in-order, there's a greater
risk of a pipeline stall if there's only one AGU (Address Generation
Unit) available, in which case, rearranging the instructions to the
following might produce better results in regards to port allocation:
movl -12(%rbp),%edx
movq %rax,%rsi
leaq -24(%rbp),%rcx
movl $2,%r8d
Other times, instruction rearrangement can aid the processor when it's
waiting for a register to be ready. For example:
movslq -12(%rbp),%rax
subq %rax,%rdx
leaq -24(%rbp),%rcx
movl $2,%r8d
Making the "subq %rax,%rdx" as late as possible will minimise the
pipeline stall:
movslq -12(%rbp),%rax
leaq -24(%rbp),%rcx
movl $2,%r8d
subq %rax,%rdx
Then one can do port rebalancing again like before:
movslq -12(%rbp),%rax
movl $2,%r8d
leaq -24(%rbp),%rcx
subq %rax,%rdx
---
TL;DR: The code itself doesn't appear to be any worse than before, but
different register assignments seem to create more redundant MOV
instructions. This one will be interesting to experiment on. I may
have to develop a register renaming system for the peephole optimizer
and a means of deallocating registers when things get moved around
(generally, deallocating registers is a little unsafe, so care has to be
taken). Also, POPCNT is handled in the System unit and is not inlined,
mainly because of how an input of zero is handled (the function has to
return 255 in the result, whereas Intel's POPCNT instruction sets the
zero flag and leaves the result undefined).
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf8lentest-comparison.zip
Type: application/x-zip-compressed
Size: 11849 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20220104/c48d8b75/attachment-0001.bin>
More information about the fpc-devel
mailing list