[fpc-pascal] Floating Point Performance on Intel

Mon Mar 28 17:19:19 CEST 2005

Hi All,

My name is Peter Dove, I am new to FPC and Lazarus. I come from a
mainly Delphi background but I use C, C++ and assembler as needed to
improve performance on the imagining app we are working on.

Like Delphi, FPC has a poor floating point optimisation situation in
comparison to similar compiles in C. For instance the following code
in Pascal

     A := 0;
     B := 0.9;
     For X := 0 to 10000000 do
     begin
          A := A + X;
          A := A * B;
     end;

Takes some 220ms to perform. The major problem with the performance is
the poor loop optimisation and register usage, also with wasted push
and pulls from memory. Below is the result from the assembler output
from FPC - all optimisations were enabled..

# Var A located at ebp-4
# Var B located at ebp-8
# Var X located at ebp-12

//A + B are set up before here - its the loop thats interrsting

# [44] For X := 0 to 10000000 do
        movl    $0,-12(%ebp)
        decl    -12(%ebp)
        .balign 4
.L31:
        incl    -12(%ebp)
# [46] A := A + X;
        flds    -4(%ebp)
        fildl   -12(%ebp)
        faddp   %st,%st(1)
        fstps   -4(%ebp)
# [47] A := A * B;
        flds    -8(%ebp)
        fmuls   -4(%ebp)
        fstps   -4(%ebp)
        cmpl    $10000000,-12(%ebp)
        jl      .L31

My comments on this are that

a) The loop counter is basically a comparison against a memory area =
slow
b) There are some unnessary loads from memory occuring = slow

The above code takes about 210ms to perform on my machine. Below is my
own assembler which takes about 100ms ( apologies it is in a slightly
different format )

asm
   mov eax, 0; //Set up loop counter
   @StartOfLoop:
   mov dword ptr[x], eax; // Move its value into X ( on stack )
   FILD dword ptr[x]; //Load into floating point
   FADD dword ptr[A]; // Add A ( on Stack ) to it
   FMUL dword ptr[B]; //Multiply by B ( on Stack )
   FSTP dword ptr[A]; // Pop into A
   add eax, 1; //Inc loop counter
   cmp eax, 10000000; // Test Jump condition
   jl @StartOfLoop;
end;

My question is, what needs to be done to the compiler to make it
optimise as well as C compilers, or perhaps I am missing some compiler
switches.