[fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

J. Gareth Moreton gareth at moreton-family.com
Tue Jan 4 17:15:50 CET 2022


I neglected to include -Cpcoreavx, that was my bad.  I'll try again.

According to Intel® 64 and IA-32 Architectures Software Developer’s 
Manual, Vol 2B, Page 4-391.  The zero flag is set if the source is zero, 
and cleared otherwise.  Regarding an undefined result, I got confused 
with the BSF and BSR commands, sorry.  I guess I was more tired than I 
thought!  POPCNT returns zero for a zero input.

Gareth aka. Kit

On 04/01/2022 16:03, Marco van de Voort via fpc-devel wrote:
> On 4-1-2022 16:31, Martin Frb via fpc-devel wrote:
>>
>>> Weird as mine is inlined with -Cpcoreavx -O4, with no special 
>>> handling for 0. But that does put some things on shaky ground. Maybe 
>>> zero the result before hand?
>>
>> Same here.
>
> I looked up popcnt and found nothing about not setting if zero. (E.g. 
> https://www.felixcloutier.com/x86/popcnt )
>
> I meanwhile also ran on my Ryzen 4800H laptop and updated the version 
> on the web with the stats. The stats for the  long string are about as 
> fast as on my i7-3770 (laptop vs desktop memory bandwidth? The ryzen 
> should be faster in any way?!?), but the short one (40 bytes) is 
> significantly faster. What I don't get is why the assembler version 
> seems systematically faster even for the short code. The generated asm 
> is nearly the same.
>
> Also notable is that on this machine with popcnt (-Cpcoreavx), the 
> popcnt version is as fast as the add function within error margins, so 
> probably popcnt instruction is faster(lower latency) and thus less of 
> a bottleneck on this machine.  Note that the POP() function is half 
> the size, so that makes it better for newer machines.
>
> ---------
>
> Note that I test on Windows, so it might be that the "two times load" 
> is a difference somehow due to different codegeneration on windows
>
>>
>> ----------------------------------------
>> About UTF8LengthFast()
>>
>> Well, before I get to this, I noted something weird.....
>>
>> 2 runs, compiled with the same compiler ( 3.2.3 ), and the same 
>> settings, with the only difference: -gw3 or not -gw3
>> => And the speed differed.  600 (with dwarf)  vs 700 (no dwarf) / 
>> reproducible.
>
> I also have seen this, while working on the code. And indeed mainly 
> with the "fast" one. It also explains why the assembler is always 
> consistent, it suffers less from detail code changes when I e.g. 
> update FPC from git, and thus different alignment. (assuming that the 
> section starts are always aligned)
>
>> Alignment. 16 vs 32 bit. Can that make a difference?
>> According to: 
>> https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache
>
> Seems to be a problem of the Skylake and later archs, which I no 
> longer have. The i7 is too old, and the others are AMD.
>
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



More information about the fpc-devel mailing list