[fpc-devel] LEA instruction speed

Wed Oct 11 02:47:56 CEST 2023

On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:
> I'm all for receiving results for all kinds of processor, as it helps
> me to make more informed choices on flags as well as confirming that
> Agner Fog''s instruction tables are correct. Also, results for older
> processors can be hard to come by sometimes.
> 
> Currently, most architectures have a fast LEA, and the default
> "Athlon" option lines up with this.  Of the Intel architectures, the
> speed slows down on COREAVX onwards (COREI is fine), so I added a new
> COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
> the point where LEA is fast again (its 16-bit version is also fast,
> unlike Zen 3).
> 
> In the meantime I'll be looking at the benchmarking code that Stefan
> provided to see if it can and should be integrated.
> 
> Thanks again everyone for the results you're giving.

Alright, fine (I modified your test to include the CPU name as well if 
possible and added an IFDEFed distinction of 32-bits versus 64-bits):

32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-----------------------------------------------------
    Pascal control case: 0.85 ns/call
  Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call

64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-----------------------------------------------------
    Pascal control case: 0.85 ns/call
  Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call

32-bits:
CPU = AMD Athlon(tm) Processor
------------------------------
    Pascal control case: 6.10 ns/call
  Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call

32-bits:
(AMD DX4 100 MHz - no CPUID name)
    Pascal control case: 123 ns/call
  Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call

Tomas

> 
> On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:
>> On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:
>>> Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:
>>>> Thanks Tomas,
>>>> 
>>>> Nothing is broken, but the timing measurement isn't precise enough.
>>>> 
>>>> Normally I have a much higher iteration count (e.g. 1,000,000), but 
>>>> I had reduced it to 10,000 because, coupled with the 1,000 
>>>> iterations in the subroutines themselves, would have led to 
>>>> 1,000,000,000 passes and hence would take in the region of five to 
>>>> ten minutes to complete for a 16 MHz 386, for example.  Rika's 
>>>> suggestion of running as many iterations as needed until, say, 5 
>>>> seconds elapses, would help but the timing measurements would cause 
>>>> a lot of latency and will be imprecise on very slow routines.  
>>>> Still, let's see if 100,000 gives better results for you.
>>>> 
>>> I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)
>>> 
>>>    Pascal control case: 0.7 ns/call
>>>  Using LEA instruction: 0.4 ns/call
>>> Using ADD instructions: 0.7 ns/call
>> 
>> Indeed, it's much more consistent now, attached a new log for both 
>> 32-bit and 64-bit versions from the Intel machine with Windows. 
>> Apparently, ADD is still somewhat faster on such "newer" Intel 
>> machines (at least if not considering the potential parallelism of LEA 
>> discussed previously). I can try this version on my AMD machines later 
>> tonight if considered useful - please, let me know which results would 
>> be relevant for you in that case (out of the ancient AMD DX4, only 
>> slightly less ancient AMD Athlon 1 GHz and the still rather reasonable 
>> AMD A9).
>> 
>> Tomas
>> 
>> _______________________________________________
>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel