[fpc-devel] LEA instruction speed

Tue Oct 10 13:24:24 CEST 2023

I'm all for receiving results for all kinds of processor, as it helps me 
to make more informed choices on flags as well as confirming that Agner 
Fog''s instruction tables are correct. Also, results for older 
processors can be hard to come by sometimes.

Currently, most architectures have a fast LEA, and the default "Athlon" 
option lines up with this.  Of the Intel architectures, the speed slows 
down on COREAVX onwards (COREI is fine), so I added a new COREX (for 
10th generation Core) option between ZEN2 and ZEN3 to mark the point 
where LEA is fast again (its 16-bit version is also fast, unlike Zen 3).

In the meantime I'll be looking at the benchmarking code that Stefan 
provided to see if it can and should be integrated.

Thanks again everyone for the results you're giving.

Kit

P.S. In regards to parallelisation in having LEA instructions running 
alongside other arithmetic/logical operations, that will be an 
interesting field of research.  At the very least, the post-peephole 
stage can change ADD or SUB into a LEA if using an AGU over an ALU 
appears to give a micro-optimisation.  It also benefits hyperthreading, 
as the ALUs tend to be very heavily used, while AGUs tend to be used one 
at a time.

On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:
> On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:
>> Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:
>>> Thanks Tomas,
>>>
>>> Nothing is broken, but the timing measurement isn't precise enough.
>>>
>>> Normally I have a much higher iteration count (e.g. 1,000,000), but 
>>> I had reduced it to 10,000 because, coupled with the 1,000 
>>> iterations in the subroutines themselves, would have led to 
>>> 1,000,000,000 passes and hence would take in the region of five to 
>>> ten minutes to complete for a 16 MHz 386, for example.  Rika's 
>>> suggestion of running as many iterations as needed until, say, 5 
>>> seconds elapses, would help but the timing measurements would cause 
>>> a lot of latency and will be imprecise on very slow routines.  
>>> Still, let's see if 100,000 gives better results for you.
>>>
>> I had the same problem, and now it is stable  Ryzen 5700X (ZEN3)
>>
>>    Pascal control case: 0.7 ns/call
>>  Using LEA instruction: 0.4 ns/call
>> Using ADD instructions: 0.7 ns/call
>
> Indeed, it's much more consistent now, attached a new log for both 
> 32-bit and 64-bit versions from the Intel machine with Windows. 
> Apparently, ADD is still somewhat faster on such "newer" Intel 
> machines (at least if not considering the potential parallelism of LEA 
> discussed previously). I can try this version on my AMD machines later 
> tonight if considered useful - please, let me know which results would 
> be relevant for you in that case (out of the ancient AMD DX4, only 
> slightly less ancient AMD Athlon 1 GHz and the still rather reasonable 
> AMD A9).
>
> Tomas
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel