[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
Marco van de Voort
fpc at pascalprogramming.org
Tue Jan 4 10:31:05 CET 2022
On 4-1-2022 01:06, J. Gareth Moreton via fpc-devel wrote:
> Prepare for a lot of technical rambling!
>
> This is just an analysis of the compilation of utf8lentest.lpr, not
> any of the System units. Notably, POPCNT isn't called directly, but
> instead goes through the System unit via "call fpc_popcnt_qword" on
> both 3.2.x and 3.3.1. A future study of "fpc_popcnt_qword" might
> yield some interesting information.
Did you pass -Cpcoreavx ?
# [381] Result += PopCnt(qword(nx));
popcntq %r10,%r8
addq %r8,%rbx
>
> In this situation, one can replace %r11 with %r13 in the 3rd and 4th
> instructions so the massive constant isn't assigned twice:
The large constant(s) should be hoisted out of the loop. This is done in
the assembler code:
.Lj11:
movq (%rcx),%r8
movq %r8,%r10
andq %rdx,%r10
notq %r8
shrq $7,%r10
shrq $6,%r8
andq %r8,%r10
subq %r8,%rax // subtract from rax=length(s). Iow
algorithm is slightly different. don't sum
// correction but
directly subtract from length. Saves bytecount from being a live value
till the end
popcntq %r10,%r8
addq $8,%rcx
decq %r11
jne .Lj11
As far as I can see the assembler version is about 25-33% faster for low
counts (40 bytes) . For larger strings, the SSE2 code kicks in and that
is a solid factor 2. (270 vs 538 for the fastest "add", see table in
the comments of the source)
Note that I use pop/fast because they are simpler and I only use for low
counts. The SSE2 version is more like the ADD variant.
> For actual reduction of pipeline stalls, I'm not quite sure how smart
> the Intel CPU firmware is and if rearranging instructions would
> actually improve throughput or not. For example, on 3.2.x, there are
> these four instructions that configure parameters and preserve %rax:
>
I tried for above code on godbolt, attached is the analysis. (both for
the asm loop, the popcnt loop from my code and the popcnt with the
movabs removed to test hoisting)
Note my 3.3.1 (from 29 dec) doesn't seem to generate two loads. With my
limited assembler abilities I'd say it is not the hoisting that makes
the manual asm loop faster as I first thought, but moving the NOT up to
where it is.
> Also, POPCNT is handled in the System unit and is not inlined, mainly
> because of how an input of zero is handled (the function has to return
> 255 in the result, whereas Intel's POPCNT instruction sets the zero
> flag and leaves the result undefined).
Weird as mine is inlined with -Cpcoreavx -O4, with no special handling
for 0. But that does put some things on shaky ground. Maybe zero the
result before hand?
-------------- next part --------------
manual asm :
--------------------------------------
.Lj47:
add rsi,1
mov r8,qword [rcx]
mov r10,r8
mov rdi,-9187201950435737472
and r10,rdi
shr r10,7
not r8
shr r8,6
and r10,r8
mov r11,r10
popcnt r8,r10
add rbx,r8
add rcx,8
cmp rsi,r9
jnge .Lj47
llvm-mca: -mcpu=ivybridge
Iterations: 100
Instructions: 1200
Total Cycles: 378
Total uOps: 1200
Dispatch Width: 4
uOps Per Cycle: 3.17
IPC: 3.17
Block RThroughput: 3.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 5 0.50 * mov r8, qword ptr [rcx]
1 1 0.33 mov r10, r8
1 1 0.33 and r10, rdx
1 1 0.33 not r8
1 1 0.50 shr r10, 7
1 1 0.50 shr r8, 6
1 1 0.33 and r10, r8
1 1 0.33 sub rax, r8
1 3 1.00 popcnt r8, r10
1 1 0.33 add rcx, 8
1 1 0.33 dec r11
1 1 1.00 jne .Lrestqwords
Resources:
[0] - SBDivider
[1] - SBFPDivider
[2] - SBPort0
[3] - SBPort1
[4] - SBPort4
[5] - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6.0] [6.1]
- - 3.66 3.67 - 3.67 0.50 0.50
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6.0] [6.1] Instructions:
- - - - - - 0.50 0.50 mov r8, qword ptr [rcx]
- - 0.21 0.63 - 0.16 - - mov r10, r8
- - 0.35 0.61 - 0.04 - - and r10, rdx
- - 0.50 0.16 - 0.34 - - not r8
- - 0.52 - - 0.48 - - shr r10, 7
- - 0.51 - - 0.49 - - shr r8, 6
- - 0.87 0.08 - 0.05 - - and r10, r8
- - 0.03 0.10 - 0.87 - - sub rax, r8
- - - 1.00 - - - - popcnt r8, r10
- - 0.47 0.36 - 0.17 - - add rcx, 8
- - 0.20 0.73 - 0.07 - - dec r11
- - - - - 1.00 - - jne .Lrestqwords
popcnt trunk -O4 -Cpcoreavx
.Lj47:
add rsi,1
mov r8,qword ptr [rcx]
mov r10,r8
mov rdi,-9187201950435737472
and r10,rdi
shr r10,7
not r8
shr r8,6
and r10,r8
mov r11,r10
popcnt r8,r10
add rbx,r8
add rcx,8
cmp rsi,r9
jnge .Lj47
llvm-mca:
Iterations: 100
Instructions: 1500
Total Cycles: 477
Total uOps: 1500
Dispatch Width: 4
uOps Per Cycle: 3.14
IPC: 3.14
Block RThroughput: 3.8
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 add rsi, 1
1 5 0.50 * mov r8, qword ptr [rcx]
1 1 0.33 mov r10, r8
1 1 0.33 movabs rdi, -9187201950435737472
1 1 0.33 and r10, rdi
1 1 0.50 shr r10, 7
1 1 0.33 not r8
1 1 0.50 shr r8, 6
1 1 0.33 and r10, r8
1 1 0.33 mov r11, r10
1 3 1.00 popcnt r8, r10
1 1 0.33 add rbx, r8
1 1 0.33 add rcx, 8
1 1 0.33 cmp rsi, r9
1 1 1.00 jl .Lj47
Resources:
[0] - SBDivider
[1] - SBFPDivider
[2] - SBPort0
[3] - SBPort1
[4] - SBPort4
[5] - SBPort5
[6.0] - SBPort23
[6.1] - SBPort23
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6.0] [6.1]
- - 4.66 4.67 - 4.67 0.50 0.50
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6.0] [6.1] Instructions:
- - 0.09 0.67 - 0.24 - - add rsi, 1
- - - - - - 0.50 0.50 mov r8, qword ptr [rcx]
- - 0.08 0.51 - 0.41 - - mov r10, r8
- - 0.25 0.74 - 0.01 - - movabs rdi, -9187201950435737472
- - 0.59 0.16 - 0.25 - - and r10, rdi
- - 0.83 - - 0.17 - - shr r10, 7
- - 0.01 0.24 - 0.75 - - not r8
- - 0.33 - - 0.67 - - shr r8, 6
- - 0.40 - - 0.60 - - and r10, r8
- - 0.91 0.01 - 0.08 - - mov r11, r10
- - - 1.00 - - - - popcnt r8, r10
- - 1.00 - - - - - add rbx, r8
- - 0.09 0.75 - 0.16 - - add rcx, 8
- - 0.08 0.59 - 0.33 - - cmp rsi, r9
- - - - - 1.00 - - jl .Lj47
when removing/hoisting the movabs:
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6.0] [6.1] Instructions:
- - 0.16 0.32 - 0.52 - - add rsi, 1
- - - - - - 0.50 0.50 mov r8, qword ptr [rcx]
- - 0.51 0.01 - 0.48 - - mov r10, r8
- - 0.67 0.33 - - - - and r10, rdi
- - 1.00 - - - - - shr r10, 7
- - 0.01 - - 0.99 - - not r8
- - 0.33 - - 0.67 - - shr r8, 6
- - 0.32 0.68 - - - - and r10, r8
- - 0.64 0.36 - - - - mov r11, r10
- - - 1.00 - - - - popcnt r8, r10
- - 0.01 0.98 - 0.01 - - add rbx, r8
- - 0.18 0.49 - 0.33 - - add rcx, 8
- - 0.50 0.17 - 0.33 - - cmp rsi, r9
- - - - - 1.00 - - jl .Lj47
and block rthroughput decreases from 3.8 to 3.5. The asm has 3.0, probably because the NOT r8
is moved up between the R10 operations
More information about the fpc-devel
mailing list