[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Martin Frb lazarus at mfriebe.de
Tue Jan 4 16:31:42 CET 2022

On 04/01/2022 10:31, Marco van de Voort via fpc-devel wrote:
> Weird as mine is inlined with -Cpcoreavx -O4, with no special handling 
> for 0. But that does put some things on shaky ground. Maybe zero the 
> result before hand?

Same here.

About UTF8LengthFast()

Well, before I get to this, I noted something weird.....

2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
settings, with the only difference: -gw3 or not -gw3
=> And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 

So then
=> I compiled the app, with the same 3.2.3  (and I compiled WITHOUT -a  
/ though with -al I get the same results)
=> I compiled once with -gw3 , and once without dwarf info
=> I used objdump to dis-assemble the exe.
=> I diffed (and searched for 0x101010.... (only used in the ...Fast code)
--> _*The assembler is identical.*_
Yet one is faster. (the one WITH dwarf)
The calling code in the main body is the same too (as far as I could 
see), except that the address of the callee is different (but it is just 
20 calls per measurement)

I did those runs OUTSIDE the IDE.
So no debugger in the background.

Win64 / 64 bit
Core I7 8600K

Using 3.3.1 the speed is equal. Never mind if dwarf is generated or not.
(I did not compare the asm for that...)

Clinging to straws, there is one (maybe) diff in the 3.2.3 with/without 
dwarf assembler.
*** I am totally out of my depth here ****

Alignment. 16 vs 32 bit. Can that make a difference?
According to: 
>     The Decoded ICache consists of 32 sets. Each set contains eight 
> Ways. Each Way can hold up to six micro-ops.
>     All micro-ops in a Way represent instructions which are statically 
> contiguous in the code and have their EIPs within _*the same aligned 
> 32-byte region*_.

So the alignment of the 2 procedures differs  by 16 bytes
The proc entry is at
With dwarf 100001870
without   100001860  // actually this is 32byte aligned (but slower)

Yet, maybe it matters which statements in the big loop happen to fall 
into the same 32byte block???

The loop starting with
    for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do

With DWARF (faster):
    1000018f0:    49 83 c2 01              add    $0x1,%r10
    1000018f4:    4c 8b 19                 mov    (%rcx),%r11
    1000018f7:    4d 89 d8                 mov    %r11,%r8
    1000018fa:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
    100001901:    80 80 80
    100001904:    49 21 f8                 and    %rdi,%r8
    100001907:    49 c1 e8 07              shr    $0x7,%r8
    10000190b:    49 f7 d3                 not    %r11
    10000190e:    49 c1 eb 06              shr    $0x6,%r11
    100001912:    4d 21 d8                 and    %r11,%r8
    100001915:    4c 89 c3                 mov    %r8,%rbx
    100001918:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
    10000191f:    01 01 01
    100001922:    4d 0f af c3              imul   %r11,%r8
    100001926:    49 c1 e8 38              shr    $0x38,%r8
    10000192a:    4c 01 c6                 add    %r8,%rsi
    10000192d:    48 83 c1 08              add    $0x8,%rcx
    100001931:    4d 39 d1                 cmp    %r10,%r9
    100001934:    7f ba                    jg     1000018f0 

    1000018e0:    49 83 c2 01              add    $0x1,%r10
    1000018e4:    4c 8b 19                 mov    (%rcx),%r11
    1000018e7:    4d 89 d8                 mov    %r11,%r8
    1000018ea:    48 bf 80 80 80 80 80     movabs $0x8080808080808080,%rdi
    1000018f1:    80 80 80
    1000018f4:    49 21 f8                 and    %rdi,%r8
    1000018f7:    49 c1 e8 07              shr    $0x7,%r8
    1000018fb:    49 f7 d3                 not    %r11
    1000018fe:    49 c1 eb 06              shr    $0x6,%r11
    100001902:    4d 21 d8                 and    %r11,%r8
    100001905:    4c 89 c3                 mov    %r8,%rbx
    100001908:    49 bb 01 01 01 01 01     movabs $0x101010101010101,%r11
    10000190f:    01 01 01
    100001912:    4d 0f af c3              imul   %r11,%r8
    100001916:    49 c1 e8 38              shr    $0x38,%r8
    10000191a:    4c 01 c6                 add    %r8,%rsi
    10000191d:    48 83 c1 08              add    $0x8,%rcx
    100001921:    4d 39 d1                 cmp    %r10,%r9
    100001924:    7f ba                    jg     0x1000018e0

*** written before I got into the above......

About UTF8LengthFast()

I notice 3 differences.
Though I only compare the O4 result, and have no idea what is 
pre-peephole and post-peephole.

1) As you say: different registers (but the same statements, in the same 
No idea if that affects the CPU.

2) One extra statement in 3.3.1

     movq    %r10,%r8 //// <<<<<<<<<<<<<<< not in 3.2.3
     movq    $72340172838076673,%r10
     imulq    %r10,%r8

3) I do have "dwarf" enabled. Even though, at O4 that is not expected to 
do any good.
I noted that in 3.3.1 this leads to way more asm-labels than in 3.2.3.
Those labels are only referred to by dwarf-line info (some asm 
statements, are reported to be in the "begin" line. Even so, they are 
clearly not.
- It could be a result of the peephole opt.
- But it could also be that the peephole is affected by the presence of 
those labels.
