[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
Marco van de Voort
fpc at pascalprogramming.org
Tue Jan 4 17:03:42 CET 2022
On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:
>
>> Weird as mine is inlined with -Cpcoreavx -O4, with no special
>> handling for 0. But that does put some things on shaky ground. Maybe
>> zero the result before hand?
>
> Same here.
I looked up popcnt and found nothing about not setting if zero. (E.g.
https://www.felixcloutier.com/x86/popcnt )
I meanwhile also ran on my Ryzen 4800H laptop and updated the version on
the web with the stats. The stats for the long string are about as fast
as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen should
be faster in any way?!?), but the short one (40 bytes) is significantly
faster. What I don't get is why the assembler version seems
systematically faster even for the short code. The generated asm is
nearly the same.
Also notable is that on this machine with popcnt (-Cpcoreavx), the
popcnt version is as fast as the add function within error margins, so
probably popcnt instruction is faster(lower latency) and thus less of a
bottleneck on this machine. Note that the POP() function is half the
size, so that makes it better for newer machines.
---------
Note that I test on Windows, so it might be that the "two times load" is
a difference somehow due to different codegeneration on windows
>
> ----------------------------------------
> About UTF8LengthFast()
>
> Well, before I get to this, I noted something weird.....
>
> 2 runs, compiled with the same compiler ( 3.2.3 ), and the same
> settings, with the only difference: -gw3 or not -gw3
> => And the speed differed. 600 (with dwarf) vs 700 (no dwarf) /
> reproducible.
I also have seen this, while working on the code. And indeed mainly with
the "fast" one. It also explains why the assembler is always consistent,
it suffers less from detail code changes when I e.g. update FPC from
git, and thus different alignment. (assuming that the section starts are
always aligned)
> Alignment. 16 vs 32 bit. Can that make a difference?
> According to:
> https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache
Seems to be a problem of the Skylake and later archs, which I no longer
have. The i7 is too old, and the others are AMD.
More information about the fpc-devel
mailing list