[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Tue Jan 4 10:31:05 CET 2022

On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote:
> Prepare for a lot of technical rambling!
>
> This is just an analysis of the compilation of utf8lentest.lpr, not 
> any of the System units.  Notably, POPCNT isn't called directly, but 
> instead goes through the System unit via "call fpc_popcnt_qword" on 
> both 3.2.x and 3.3.1.  A future study of "fpc_popcnt_qword" might 
> yield some interesting information.

Did you pass -Cpcoreavx    ?

# [381] Result += PopCnt(qword(nx));
     popcntq    %r10,%r8
     addq    %r8,%rbx

>
> In this situation, one can replace %r11 with %r13 in the 3rd and 4th 
> instructions so the massive constant isn't assigned twice:

The large constant(s) should be hoisted out of the loop. This is done in 
the assembler code:

.Lj11:
     movq    (%rcx),%r8
     movq    %r8,%r10
     andq    %rdx,%r10
     notq    %r8
     shrq    $7,%r10
     shrq    $6,%r8
     andq    %r8,%r10
     subq    %r8,%rax               // subtract from rax=length(s). Iow 
algorithm is slightly different. don't sum

                                                 // correction but 
directly subtract from length. Saves bytecount from being a live value 
till the end

      popcntq    %r10,%r8
     addq    $8,%rcx
     decq    %r11
     jne    .Lj11

As far as I can see the assembler version is about 25-33% faster for low 
counts (40 bytes) . For larger strings, the SSE2 code kicks in and that 
is a solid factor 2.  (270 vs 538 for the fastest "add", see table in 
the comments of the source)

Note that I use pop/fast because they are simpler and I only use for low 
counts. The SSE2 version is more like the ADD variant.

> For actual reduction of pipeline stalls, I'm not quite sure how smart 
> the Intel CPU firmware is and if rearranging instructions would 
> actually improve throughput or not.  For example, on 3.2.x, there are 
> these four instructions that configure parameters and preserve %rax:
>
I tried for above code on godbolt, attached is the analysis. (both for 
the asm loop, the popcnt loop from my code and the popcnt with the 
movabs removed to test hoisting)

Note my 3.3.1 (from 29 dec) doesn't seem to generate two loads. With my 
limited assembler abilities I'd say it is not the hoisting that makes 
the manual asm loop faster as I first thought, but moving the NOT up to 
where it is.

> Also, POPCNT is handled in the System unit and is not inlined, mainly 
> because of how an input of zero is handled (the function has to return 
> 255 in the result, whereas Intel's POPCNT instruction sets the zero 
> flag and leaves the result undefined).

Weird as mine is inlined with -Cpcoreavx -O4, with no special handling 
for 0. But that does put some things on shaky ground. Maybe zero the 
result before hand?

-------------- next part --------------
manual asm :
--------------------------------------

.Lj47:
		add	rsi,1
		mov	r8,qword [rcx]
		mov	r10,r8
		mov	rdi,-9187201950435737472
		and	r10,rdi
		shr	r10,7
		not	r8
		shr	r8,6
		and	r10,r8
		mov	r11,r10
		popcnt	r8,r10
		add	rbx,r8
		add	rcx,8
		cmp	rsi,r9
		jnge	.Lj47

llvm-mca: -mcpu=ivybridge

Iterations:        100
Instructions:      1200
Total Cycles:      378
Total uOps:        1200

Dispatch Width:    4
uOps Per Cycle:    3.17
IPC:               3.17
Block RThroughput: 3.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   mov   r8, qword ptr [rcx]
 1      1     0.33                        mov   r10, r8
 1      1     0.33                        and   r10, rdx
 1      1     0.33                        not   r8
 1      1     0.50                        shr   r10, 7
 1      1     0.50                        shr   r8, 6
 1      1     0.33                        and   r10, r8
 1      1     0.33                        sub   rax, r8
 1      3     1.00                        popcnt        r8, r10
 1      1     0.33                        add   rcx, 8
 1      1     0.33                        dec   r11
 1      1     1.00                        jne   .Lrestqwords

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  
 -      -     3.66   3.67    -     3.67   0.50   0.50   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr [rcx]
 -      -     0.21   0.63    -     0.16    -      -     mov     r10, r8
 -      -     0.35   0.61    -     0.04    -      -     and     r10, rdx
 -      -     0.50   0.16    -     0.34    -      -     not     r8
 -      -     0.52    -      -     0.48    -      -     shr     r10, 7
 -      -     0.51    -      -     0.49    -      -     shr     r8, 6
 -      -     0.87   0.08    -     0.05    -      -     and     r10, r8
 -      -     0.03   0.10    -     0.87    -      -     sub     rax, r8
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     0.47   0.36    -     0.17    -      -     add     rcx, 8
 -      -     0.20   0.73    -     0.07    -      -     dec     r11
 -      -      -      -      -     1.00    -      -     jne     .Lrestqwords

popcnt  trunk -O4 -Cpcoreavx

.Lj47:
		add	rsi,1
		mov	r8,qword ptr [rcx]
		mov	r10,r8
		mov	rdi,-9187201950435737472
		and	r10,rdi
		shr	r10,7
		not	r8
		shr	r8,6
		and	r10,r8
		mov	r11,r10
		popcnt	r8,r10
		add	rbx,r8
		add	rcx,8
		cmp	rsi,r9
		jnge	.Lj47

llvm-mca:

Iterations:        100
Instructions:      1500
Total Cycles:      477
Total uOps:        1500

Dispatch Width:    4
uOps Per Cycle:    3.14
IPC:               3.14
Block RThroughput: 3.8

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        add   rsi, 1
 1      5     0.50    *                   mov   r8, qword ptr [rcx]
 1      1     0.33                        mov   r10, r8
 1      1     0.33                        movabs        rdi, -9187201950435737472
 1      1     0.33                        and   r10, rdi
 1      1     0.50                        shr   r10, 7
 1      1     0.33                        not   r8
 1      1     0.50                        shr   r8, 6
 1      1     0.33                        and   r10, r8
 1      1     0.33                        mov   r11, r10
 1      3     1.00                        popcnt        r8, r10
 1      1     0.33                        add   rbx, r8
 1      1     0.33                        add   rcx, 8
 1      1     0.33                        cmp   rsi, r9
 1      1     1.00                        jl    .Lj47

Resources:
[0]   - SBDivider
[1]   - SBFPDivider
[2]   - SBPort0
[3]   - SBPort1
[4]   - SBPort4
[5]   - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  
 -      -     4.66   4.67    -     4.67   0.50   0.50   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -     0.09   0.67    -     0.24    -      -     add     rsi, 1
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr [rcx]
 -      -     0.08   0.51    -     0.41    -      -     mov     r10, r8
 -      -     0.25   0.74    -     0.01    -      -     movabs  rdi, -9187201950435737472
 -      -     0.59   0.16    -     0.25    -      -     and     r10, rdi
 -      -     0.83    -      -     0.17    -      -     shr     r10, 7
 -      -     0.01   0.24    -     0.75    -      -     not     r8
 -      -     0.33    -      -     0.67    -      -     shr     r8, 6
 -      -     0.40    -      -     0.60    -      -     and     r10, r8
 -      -     0.91   0.01    -     0.08    -      -     mov     r11, r10
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     1.00    -      -      -      -      -     add     rbx, r8
 -      -     0.09   0.75    -     0.16    -      -     add     rcx, 8
 -      -     0.08   0.59    -     0.33    -      -     cmp     rsi, r9
 -      -      -      -      -     1.00    -      -     jl      .Lj47

when removing/hoisting the movabs:

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6.0]  [6.1]  Instructions:
 -      -     0.16   0.32    -     0.52    -      -     add     rsi, 1
 -      -      -      -      -      -     0.50   0.50   mov     r8, qword ptr [rcx]
 -      -     0.51   0.01    -     0.48    -      -     mov     r10, r8
 -      -     0.67   0.33    -      -      -      -     and     r10, rdi
 -      -     1.00    -      -      -      -      -     shr     r10, 7
 -      -     0.01    -      -     0.99    -      -     not     r8
 -      -     0.33    -      -     0.67    -      -     shr     r8, 6
 -      -     0.32   0.68    -      -      -      -     and     r10, r8
 -      -     0.64   0.36    -      -      -      -     mov     r11, r10
 -      -      -     1.00    -      -      -      -     popcnt  r8, r10
 -      -     0.01   0.98    -     0.01    -      -     add     rbx, r8
 -      -     0.18   0.49    -     0.33    -      -     add     rcx, 8
 -      -     0.50   0.17    -     0.33    -      -     cmp     rsi, r9
 -      -      -      -      -     1.00    -      -     jl      .Lj47

and block rthroughput decreases from 3.8 to 3.5. The asm has 3.0, probably because the NOT r8 
is moved up between the R10 operations