[fpc-devel] LEA instruction speed
Tomas Hajny
XHajT03 at hajny.biz
Wed Oct 11 02:47:56 CEST 2023
On 2023-10-10 13:24, J. Gareth Moreton via fpc-devel wrote:
> I'm all for receiving results for all kinds of processor, as it helps
> me to make more informed choices on flags as well as confirming that
> Agner Fog''s instruction tables are correct. Also, results for older
> processors can be hard to come by sometimes.
>
> Currently, most architectures have a fast LEA, and the default
> "Athlon" option lines up with this. Of the Intel architectures, the
> speed slows down on COREAVX onwards (COREI is fine), so I added a new
> COREX (for 10th generation Core) option between ZEN2 and ZEN3 to mark
> the point where LEA is fast again (its 16-bit version is also fast,
> unlike Zen 3).
>
> In the meantime I'll be looking at the benchmarking code that Stefan
> provided to see if it can and should be integrated.
>
> Thanks again everyone for the results you're giving.
Alright, fine (I modified your test to include the CPU name as well if
possible and added an IFDEFed distinction of 32-bits versus 64-bits):
32-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-----------------------------------------------------
Pascal control case: 0.85 ns/call
Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.84 ns/call
64-bits:
CPU = AMD A9-9425 RADEON R5, 5 COMPUTE CORES 2C+3G
-----------------------------------------------------
Pascal control case: 0.85 ns/call
Using LEA instruction: 0.56 ns/call
Using ADD instructions: 0.85 ns/call
32-bits:
CPU = AMD Athlon(tm) Processor
------------------------------
Pascal control case: 6.10 ns/call
Using LEA instruction: 3.40 ns/call
Using ADD instructions: 3.40 ns/call
32-bits:
(AMD DX4 100 MHz - no CPUID name)
Pascal control case: 123 ns/call
Using LEA instruction: 72 ns/call
Using ADD instructions: 73 ns/call
Tomas
>
> On 10/10/2023 11:54, Tomas Hajny via fpc-devel wrote:
>> On 2023-10-10 12:19, Marco van de Voort via fpc-devel wrote:
>>> Op 10-10-2023 om 11:13 schreef J. Gareth Moreton via fpc-devel:
>>>> Thanks Tomas,
>>>>
>>>> Nothing is broken, but the timing measurement isn't precise enough.
>>>>
>>>> Normally I have a much higher iteration count (e.g. 1,000,000), but
>>>> I had reduced it to 10,000 because, coupled with the 1,000
>>>> iterations in the subroutines themselves, would have led to
>>>> 1,000,000,000 passes and hence would take in the region of five to
>>>> ten minutes to complete for a 16 MHz 386, for example. Rika's
>>>> suggestion of running as many iterations as needed until, say, 5
>>>> seconds elapses, would help but the timing measurements would cause
>>>> a lot of latency and will be imprecise on very slow routines.
>>>> Still, let's see if 100,000 gives better results for you.
>>>>
>>> I had the same problem, and now it is stable Ryzen 5700X (ZEN3)
>>>
>>> Pascal control case: 0.7 ns/call
>>> Using LEA instruction: 0.4 ns/call
>>> Using ADD instructions: 0.7 ns/call
>>
>> Indeed, it's much more consistent now, attached a new log for both
>> 32-bit and 64-bit versions from the Intel machine with Windows.
>> Apparently, ADD is still somewhat faster on such "newer" Intel
>> machines (at least if not considering the potential parallelism of LEA
>> discussed previously). I can try this version on my AMD machines later
>> tonight if considered useful - please, let me know which results would
>> be relevant for you in that case (out of the ancient AMD DX4, only
>> slightly less ancient AMD Athlon 1 GHz and the still rather reasonable
>> AMD A9).
>>
>> Tomas
>>
>> _______________________________________________
>> fpc-devel maillist - fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
More information about the fpc-devel
mailing list