[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Tue Jan 4 17:03:42 CET 2022

On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:
>
>> Weird as mine is inlined with -Cpcoreavx -O4, with no special 
>> handling for 0. But that does put some things on shaky ground. Maybe 
>> zero the result before hand?
>
> Same here.

I looked up popcnt and found nothing about not setting if zero. (E.g. 
https://www.felixcloutier.com/x86/popcnt )

I meanwhile also ran on my Ryzen 4800H laptop and updated the version on 
the web with the stats. The stats for the  long string are about as fast 
as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen should 
be faster in any way?!?), but the short one (40 bytes) is significantly 
faster. What I don't get is why the assembler version seems 
systematically faster even for the short code. The generated asm is 
nearly the same.

Also notable is that on this machine with popcnt (-Cpcoreavx), the 
popcnt version is as fast as the add function within error margins, so 
probably popcnt instruction is faster(lower latency) and thus less of a 
bottleneck on this machine.  Note that the POP() function is half the 
size, so that makes it better for newer machines.

---------

Note that I test on Windows, so it might be that the "two times load" is 
a difference somehow due to different codegeneration on windows

>
> ----------------------------------------
> About UTF8LengthFast()
>
> Well, before I get to this, I noted something weird.....
>
> 2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
> settings, with the only difference: -gw3 or not -gw3
> => And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
> reproducible.

I also have seen this, while working on the code. And indeed mainly with 
the "fast" one. It also explains why the assembler is always consistent, 
it suffers less from detail code changes when I e.g. update FPC from 
git, and thus different alignment. (assuming that the section starts are 
always aligned)

> Alignment. 16 vs 32 bit. Can that make a difference?
> According to: 
> https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache

Seems to be a problem of the Skylake and later archs, which I no longer 
have. The i7 is too old, and the others are AMD.