[fpc-devel] LEA instruction speed

Sat Oct 7 20:03:20 CEST 2023

On 2023-10-07 18:09, J. Gareth Moreton via fpc-devel wrote:
> That's interesting; I am interested to see the assembly output for the
> Pascal control cases.  As for the 64-bit version, that was my fault
> since the assembly language is for Microsoft's ABI rather than the
> System V ABI, so it was checking a register with an undefined value. 
> Find attached the fixed test.
> 
> Kit
> 
> P.S. Results on my Intel(R) Core(TM) i7-10750H
> 
>    Pascal control case: 2.0 ns/call
>  Using LEA instruction: 1.7 ns/call
> Using ADD instructions: 1.3 ns/call

OK. My results for the AMD A9 CPU mentioned previously and 32-bit trunk 
compiler (Linux) are:

    Pascal control case: 2.3 ns/call
  Using LEA instruction: 1.2 ns/call
Using ADD instructions: 1.5 ns/call

The same machine, the same operating environment, but a 64-bit trunk 
compiler:

    Pascal control case: 3.6 ns/call
  Using LEA instruction: 0.9 ns/call
Using ADD instructions: 1.3 ns/call

I tried compiling and running the test with all of FPC 2.0.4, 2.2.4, 
2.4.4, 2.6.4, 3.0.4 and 3.2.2 on my Athlon machine and realized that all 
results (for both the assembler and Pascal versions) compiled with 
anything older than 3.2.2 are an order of magnitude faster than with 
3.2.2 (i.e. less than 1 ns/call for the older versions compared to 8 
ns/call with Pascal / 4 ns/call with assembler versions). This means 
that the comparison is obviously spoiled with something unrelated. 
Moreover, I noticed that when compiling with the highest level of 
optimizations, the Pascal version compiled for i386 is as fast or even 
little bit faster than the assembler version. I didn't do that 
previously, thus the longer time for the older compiler version probably 
isn't relevant. From this point of view, it probably doesn't make sense 
to spend time on comparing the generated code.

Tomas

> 
> On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
>> On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
>> 
>> 
>> Hi Kit,
>> 
>>> Do you think this should suffice? Originally it ran for 1,000,000
>>> repetitions but I fear that will take way too long on a 486, so I
>>> reduced it to 10,000.
>> 
>> OK, I tried it now. First of all, after turning on the old machine, I 
>> realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad 
>> memory. :-( I compiled and ran the test under OS/2 there (I was too 
>> lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any 
>> substantial difference. The ADD and LEA results were basically the 
>> same there, both around 100 ns / call. The Pascal result was around 
>> twice as long. Interestingly, the Pascal result for FPC 3.2.2 was 
>> around 10% longer than the same source compiled with FPC 2.0.3 (the 
>> assembler versions were obviously the same for both FPC versions; I 
>> tried compiling it also with FPC 1.0.10 and the assembler versions 
>> were more than three times slower due to missing support for the 
>> nostackframe directive).
>> 
>> I tested it under the AMD Athlon 1 GHz machine as well and again, the 
>> results for LEA and ADD are basically equal (both 3.1 ns/call) and the 
>> result for Pascal slightly more than twice (7.3 ns/call). However, 
>> rather surprisingly for me, the overall test run was _much_ longer 
>> there?! Finally, I tried compiling the test on a 64-bit machine (AMD 
>> A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from 
>> a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the 
>> assembler version runs forever - well, certainly much longer than my 
>> patience lasts. I haven't tried to analyze the reasons, but that's 
>> what I get.
>> 
>> Tomas
>> 
>> 
>> 
>>> 
>>> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
>>>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
>>>> <fpc-devel at lists.freepascal.org> wrote:
>>>> 
>>>> 
>>>> Hii Kit,
>>>> 
>>>>> This is mainly to Florian, but also to anyone else who can answer 
>>>>> the question - at which point did a complex LEA instruction (using 
>>>>> all three input operands and some other specific circumstances) get 
>>>>> slow? Preliminary research suggests the 486 was when it gained 
>>>>> extra latency, and then Sandy Bridge when it got particularly bad.  
>>>>> Icy Lake seems to be the architecture where faster LEA instructions 
>>>>> are reintroduced, but I'm not sure about AMD processors.
>>>> I cannot answer your question, but if you prepare a test program, I 
>>>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines 
>>>> if it helps you in any way (at least I hope the 486 DX2 machine 
>>>> should be still able to start ;-) ).
>>>> 
>>>> Tomas
>>>> 
>>>> _______________________________________________
>>>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>>>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>>> 
>>> _______________________________________________
>>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>> _______________________________________________
>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>> 
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel