[fpc-devel] LEA instruction speed

J. Gareth Moreton gareth at moreton-family.com
Sat Oct 7 18:09:01 CEST 2023


That's interesting; I am interested to see the assembly output for the 
Pascal control cases.  As for the 64-bit version, that was my fault 
since the assembly language is for Microsoft's ABI rather than the 
System V ABI, so it was checking a register with an undefined value.  
Find attached the fixed test.

Kit

P.S. Results on my Intel(R) Core(TM) i7-10750H

    Pascal control case: 2.0 ns/call
  Using LEA instruction: 1.7 ns/call
Using ADD instructions: 1.3 ns/call

On 07/10/2023 16:51, Tomas Hajny via fpc-devel wrote:
> On 2023-10-07 03:57, J. Gareth Moreton via fpc-devel wrote:
>
>
> Hi Kit,
>
>> Do you think this should suffice? Originally it ran for 1,000,000
>> repetitions but I fear that will take way too long on a 486, so I
>> reduced it to 10,000.
>
> OK, I tried it now. First of all, after turning on the old machine, I 
> realized that it wasn't Intel but AMD 486 DX4 - sorry for my bad 
> memory. :-( I compiled and ran the test under OS/2 there (I was too 
> lazy to boot it to DOS ;-) ), but I assume that it shouldn't make any 
> substantial difference. The ADD and LEA results were basically the 
> same there, both around 100 ns / call. The Pascal result was around 
> twice as long. Interestingly, the Pascal result for FPC 3.2.2 was 
> around 10% longer than the same source compiled with FPC 2.0.3 (the 
> assembler versions were obviously the same for both FPC versions; I 
> tried compiling it also with FPC 1.0.10 and the assembler versions 
> were more than three times slower due to missing support for the 
> nostackframe directive).
>
> I tested it under the AMD Athlon 1 GHz machine as well and again, the 
> results for LEA and ADD are basically equal (both 3.1 ns/call) and the 
> result for Pascal slightly more than twice (7.3 ns/call). However, 
> rather surprisingly for me, the overall test run was _much_ longer 
> there?! Finally, I tried compiling the test on a 64-bit machine (AMD 
> A9-9425) with Linux (compiled for 64-bits with FPC 3.2.3 compiled from 
> a fresh 3.2 branch). The Pascal version shows about 4 ns/call, but the 
> assembler version runs forever - well, certainly much longer than my 
> patience lasts. I haven't tried to analyze the reasons, but that's 
> what I get.
>
> Tomas
>
>
>
>>
>> On 03/10/2023 06:30, Tomas Hajny via fpc-devel wrote:
>>> On October 3, 2023 03:32:34 +0200, "J. Gareth Moreton via fpc-devel" 
>>> <fpc-devel at lists.freepascal.org> wrote:
>>>
>>>
>>> Hii Kit,
>>>
>>>> This is mainly to Florian, but also to anyone else who can answer 
>>>> the question - at which point did a complex LEA instruction (using 
>>>> all three input operands and some other specific circumstances) get 
>>>> slow? Preliminary research suggests the 486 was when it gained 
>>>> extra latency, and then Sandy Bridge when it got particularly bad.  
>>>> Icy Lake seems to be the architecture where faster LEA instructions 
>>>> are reintroduced, but I'm not sure about AMD processors.
>>> I cannot answer your question, but if you prepare a test program, I 
>>> can run it on an Intel 486 DX2 100 Mhz and AMD Athlon 1 GHz machines 
>>> if it helps you in any way (at least I hope the 486 DX2 machine 
>>> should be still able to start ;-) ).
>>>
>>> Tomas
>>>
>>> _______________________________________________
>>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>>
>> _______________________________________________
>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
-------------- next part --------------
program leatest;
{$MODE OBJFPC}
{$ASMMODE Intel}

uses
  SysUtils;
  
type
  TBenchmarkProc = function(const Input, X, Y: LongWord): LongWord;
 

function Checksum_PAS(const Input, X, Y: LongWord): LongWord;
var
  Counter: LongWord;
begin
  Result := Input;
  Counter := Y;
  while (Counter > 0) do
    begin
      Result := X + Counter + $87654321;
      Dec(Counter);
    end;
end;

function Checksum_ADD(const Input, X, Y: LongWord): LongWord; assembler; nostackframe;
asm
@Loop1:
{$ifdef CPUX86_64}
  {$ifdef MSWINDOWS}
  ADD ECX, $87654321
  ADD ECX, EDX
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop1
  MOV EAX, ECX
  {$else MSWINDOWS}
  ADD EDI, $87654321
  ADD EDI, ESI
  XOR EDI, EDX
  DEC EDX
  JNZ @Loop1
  MOV EAX, EDI
  {$endif MSWINDOWS}
{$else CPUX86_64}
  ADD EAX, $87654321
  ADD EAX, EDX
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop1
{$endif CPUX86_64}
end;

function Checksum_LEA(const Input, X, Y: LongWord): LongWord; assembler; nostackframe;
asm
@Loop2:
{$ifdef CPUX86_64}
  {$ifdef MSWINDOWS}
  LEA ECX, [ECX + EDX + $87654321]
  XOR ECX, R8D
  DEC R8D
  JNZ @Loop2
  MOV EAX, ECX
  {$else MSWINDOWS}
  LEA EDI, [EDI + ESI + $87654321]
  XOR EDI, EDX
  DEC EDX
  JNZ @Loop2
  MOV EAX, EDI
  {$endif MSWINDOWS}
{$else CPUX86_64}
  LEA EAX, [EAX + EDX + $87654321]
  XOR EAX, ECX
  DEC ECX
  JNZ @Loop2
{$endif CPUX86_64}
end;

function Benchmark(const name: string; proc: TBenchmarkProc; Z, X: LongWord): LongWord;
const
  internal_reps = 1000;
var
  start: TDateTime;
  time: double;
  reps: cardinal;
begin
  Result := Z;
  reps := 0;
  start := Now;
  repeat
    inc(reps);
    proc(Result, X, internal_reps);
    time := (Now - start) * SecsPerDay;
  until (reps >= 10000);
  time := time / reps / internal_reps * 1e9;
  writeln(name, ': ', time:0:ord(time < 10), ' ns/call');
end;

var
  Results: array[0..2] of LongWord;
  FailureCode: Integer;
begin
  Results[0] := Benchmark('   Pascal control case', @Checksum_PAS, 5000000, 1000);
  Results[1] := Benchmark(' Using LEA instruction', @Checksum_LEA, 5000000, 1000);
  Results[2] := Benchmark('Using ADD instructions', @Checksum_ADD, 5000000, 1000);
  
  FailureCode := 0;

  if (Results[0] <> Results[1]) then
    begin
      WriteLn('ERROR: Checksum_LEA doesn''t match control case');
      FailureCode := FailureCode or 1;
    end;
  if (Results[0] <> Results[2]) then
    begin
      WriteLn('ERROR: Checksum_ADD doesn''t match control case');
      FailureCode := FailureCode or 2
    end;
    
  if FailureCode <> 0 then
    Halt(FailureCode);
end.


More information about the fpc-devel mailing list