[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Tue Jan 4 01:06:52 CET 2022

Prepare for a lot of technical rambling!

This is just an analysis of the compilation of utf8lentest.lpr, not any 
of the System units.  Notably, POPCNT isn't called directly, but instead 
goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x 
and 3.3.1.  A future study of "fpc_popcnt_qword" might yield some 
interesting information.

 From what I've seen so far, "for i := 1 to cnt do" is implemented 
differently, although not in a way that makes it slower, and there are a 
few extra MOV instructions that shouldn't cause a speed decrease thanks 
to parallel execution.  Similarly, alignment hints now take the form of 
".p2align 4,,10; .p2align 3" rather than ".balign 8,0x90", which aligns 
code to a 16-byte boundary rather than an 8-byte boundary and uses 
multi-byte NOP instructions rather than a large number of single-byte 
NOP instructions (machine code 0x90), although there's every chance I 
configured things incorrectly.

The difficult part is seeing through the different registers, since 
3.3.1 is assigning registers differently to 3.2.x.  I've attached the 
two dumps if you want to compare them yourself (built under -O4).

 From what I can observe, the performance loss seems to be due to 
changes in code generation causing more pipeline stalls rather than any 
particular peephole optimisation.  There is room for improvement though 
- for example, both compilers have snippets of code akin to the following:

     movq    %r12,%r10
     shrq    $8,%r10
     movq    $71777214294589695,%r11
     andq    %r11,%r10
     movq    %r12,%r11
     movq    $71777214294589695,%r13
     andq    %r13,%r11
     leaq    (%r10,%r11),%r12

In this situation, one can replace %r11 with %r13 in the 3rd and 4th 
instructions so the massive constant isn't assigned twice:

     movq    %r12,%r10
     shrq    $8,%r10
     movq    $71777214294589695,%r13
     andq    %r13,%r10
     movq    %r12,%r11
     andq    %r13,%r11
     leaq    (%r10,%r11),%r12

And if %r13 isn't used again in its current form (in this particular 
snippet, it isn't... it gets assigned a new constant a few instructions 
later), one can do some clever rearranging:

     movq    %r12,%r10
     shrq    $8,%r10
     movq    $71777214294589695,%r11
     andq    %r11,%r10
     andq    %r12,%r11
     leaq    (%r10,%r11),%r12

Granted this is finding new peephole optimisations rather than fixing 
the pipeline stalls, but I can see quite a few places where a 
rearrangement of registers can produce smaller and more efficient code 
by removing redundant transfers.

There is a slight inefficiency in a nested function prologue - under 
3.2.x, it's this:

.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"x"
     .balign 16,0x90
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
     pushq    %rbp
.seh_pushreg %rbp
     movq    %rcx,%rbp
     leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
     leaq    -8(%rbp),%rcx
     call    fpc_ansistr_decr_ref
     leaq    -24(%rbp),%rcx
     call    fpc_ansistr_decr_ref

... whereas under 3.3.1...

.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"ax"
     .balign 16,0x90
.globl    P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
     pushq    %rbp
.seh_pushreg %rbp
     movq    %rcx,%rbp
     leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
     subq    $24,%rcx
     call    fpc_ansistr_decr_ref
     leaq    -8(%rbp),%rcx
     call    fpc_ansistr_decr_ref

In this case, the one in 3.2.x is slightly better and could be improved 
by removing "movq %rcx,%rbp" completely, since the value is never used 
and is not recorded in the SEH directives (also, the string objects are 
decremented in the opposite order, with 3.2.x doing -8(%rbp) first, but 
3.3.1 doing -24(%rbp) first instead).  I think this may be due to me 
putting in code in the peephole optimizer that avoids playing around 
with the function prologue. (3.2.x doesn't have the .globl symbol though)

For actual reduction of pipeline stalls, I'm not quite sure how smart 
the Intel CPU firmware is and if rearranging instructions would actually 
improve throughput or not.  For example, on 3.2.x, there are these four 
instructions that configure parameters and preserve %rax:

     movq    %rax,%rsi
     movl    -12(%rbp),%edx
     movl    $2,%r8d
     leaq    -24(%rbp),%rcx

All's okay so far, but on 3.3.1, the ordering is different:

     movq    %rax,%rsi
     movl    -12(%rbp),%edx
     leaq    -24(%rbp),%rcx
     movl    $2,%r8d

If one assumes the instructions are executed in-order, there's a greater 
risk of a pipeline stall if there's only one AGU (Address Generation 
Unit) available, in which case, rearranging the instructions to the 
following might produce better results in regards to port allocation:

     movl    -12(%rbp),%edx
     movq    %rax,%rsi
     leaq    -24(%rbp),%rcx
     movl    $2,%r8d

Other times, instruction rearrangement can aid the processor when it's 
waiting for a register to be ready.  For example:

     movslq    -12(%rbp),%rax
     subq    %rax,%rdx
     leaq    -24(%rbp),%rcx
     movl    $2,%r8d

Making the "subq %rax,%rdx" as late as possible will minimise the 
pipeline stall:

     movslq    -12(%rbp),%rax
     leaq    -24(%rbp),%rcx
     movl    $2,%r8d
     subq    %rax,%rdx

Then one can do port rebalancing again like before:

     movslq    -12(%rbp),%rax
     movl    $2,%r8d
     leaq    -24(%rbp),%rcx
     subq    %rax,%rdx

---

TL;DR: The code itself doesn't appear to be any worse than before, but 
different register assignments seem to create more redundant MOV 
instructions.  This one will be interesting to experiment on.  I may 
have to develop a register renaming system for the peephole optimizer 
and a means of deallocating registers when things get moved around 
(generally, deallocating registers is a little unsafe, so care has to be 
taken).  Also, POPCNT is handled in the System unit and is not inlined, 
mainly because of how an input of zero is handled (the function has to 
return 255 in the result, whereas Intel's POPCNT instruction sets the 
zero flag and leaves the result undefined).

Gareth aka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf8lentest-comparison.zip
Type: application/x-zip-compressed
Size: 11849 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20220104/c48d8b75/attachment-0001.bin>