[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt
J. Gareth Moreton
gareth at moreton-family.com
Tue Jan 4 17:15:50 CET 2022
I neglected to include -Cpcoreavx, that was my bad. I'll try again.
According to Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Vol 2B, Page 4-391. The zero flag is set if the source is zero,
and cleared otherwise. Regarding an undefined result, I got confused
with the BSF and BSR commands, sorry. I guess I was more tired than I
thought! POPCNT returns zero for a zero input.
Gareth aka. Kit
On 04/01/2022 16:03, Marco van de Voort via fpc-devel wrote:
> On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:
>>
>>> Weird as mine is inlined with -Cpcoreavx -O4, with no special
>>> handling for 0. But that does put some things on shaky ground. Maybe
>>> zero the result before hand?
>>
>> Same here.
>
> I looked up popcnt and found nothing about not setting if zero. (E.g.
> https://www.felixcloutier.com/x86/popcnt )
>
> I meanwhile also ran on my Ryzen 4800H laptop and updated the version
> on the web with the stats. The stats for the long string are about as
> fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen
> should be faster in any way?!?), but the short one (40 bytes) is
> significantly faster. What I don't get is why the assembler version
> seems systematically faster even for the short code. The generated asm
> is nearly the same.
>
> Also notable is that on this machine with popcnt (-Cpcoreavx), the
> popcnt version is as fast as the add function within error margins, so
> probably popcnt instruction is faster(lower latency) and thus less of
> a bottleneck on this machine. Note that the POP() function is half
> the size, so that makes it better for newer machines.
>
> ---------
>
> Note that I test on Windows, so it might be that the "two times load"
> is a difference somehow due to different codegeneration on windows
>
>>
>> ----------------------------------------
>> About UTF8LengthFast()
>>
>> Well, before I get to this, I noted something weird.....
>>
>> 2 runs, compiled with the same compiler ( 3.2.3 ), and the same
>> settings, with the only difference: -gw3 or not -gw3
>> => And the speed differed. 600 (with dwarf) vs 700 (no dwarf) /
>> reproducible.
>
> I also have seen this, while working on the code. And indeed mainly
> with the "fast" one. It also explains why the assembler is always
> consistent, it suffers less from detail code changes when I e.g.
> update FPC from git, and thus different alignment. (assuming that the
> section starts are always aligned)
>
>> Alignment. 16 vs 32 bit. Can that make a difference?
>> According to:
>> https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache
>
> Seems to be a problem of the Skylake and later archs, which I no
> longer have. The i7 is too old, and the others are AMD.
>
>
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list