[fpc-pascal] FPC Graphics options?
Jonas Maebe
jonas at freepascal.org
Sat May 20 21:34:34 CEST 2017
On 19/05/17 02:54, Ryan Joseph wrote:
>> On May 18, 2017, at 10:40 PM, Jon Foster<jon-lists at jfpossibilities.com> wrote:
>>
>> 62.44 1.33 1.33 fpc_frac_real
>> 26.76 1.90 0.57 MATH_$$_FLOOR$EXTENDED$$LONGINT
>> 10.33 2.12 0.22 FPC_DIV_INT64
> Thanks for profiling this.
>
> Floor is there as I expected and 26% is pretty extreme but the others are floating point division?
> How does Java handle this so much better than FPC and what are the work arounds?
The Pascal test program that was benchmarked here contains a number of
bugs/wrong translations from the C code (some stem from the original
version, another one was added):
1) casting a floating point number to an int in C does not round, but
truncates (I think this may have been mentioned earlier in the thread, I
didn't read everything)
2) The usage of floor in the test program is wrong. C's floor takes a
floating point number and returns one. The math unit's floor function
takes a floating point number and returns an integer. In the Pascal
version, this integer is then converted back to a floating point number
because the rest of that expression also uses floating point.
3) The Pascal version uses longword instead of int32 for a number of
variables (that are "int" in the C version). This results in one
expression getting evaluated as 64 bit on 32 bit systems, which is where
the FPC_DIV_INT64 calls come from (that's a routine to perform 64 bit
*integer* divisions on 32 bit platforms)
4) frac() is only used to get a monotonous increasing value as part of
the data input for the test program. The C code (and original Pascal
version) uses a tick count and multiplies/divides that, which is much
faster.
Then, there's one thing that can be done to optimize the Pascal version
(after removing the bugs above):
1) Compile with SSE3 or higher, in particular because SSE3 can be used
to implement trunc() with a single instruction (otherwise we pass via a
helper that uses the x87 fpu, which moreover has to reconfigure it to
change the rounding more and restore it afterwards). However, there does
seem to be a bug in FPC 3.0.2 whereby compiling this program for -O2
-Cfsse3 causes it to crash, because then it loads data from an 8-byte
aligned location on the stack. It works fine when compiled with trunk
and -O2 -Cfsse3 though (at least for 64 bit).
There's at least one minor twist of the classic "C compiler evaluates
constant stuff at compile time":
1) oy and oz are constant. The "floor" function is a standard C library
function, and hence C compilers know what it does and can evaluate it at
compile time. Therefore, the oy-floor(oy) and oz-floor(oz) expressions
are (equal) constants for C compilers.
Finally, there are two things FPC definitely is missing:
1) an SSE version of the int() function (which is the basis of a
floating point version of floor()) (fairly specific to this program)
2) SSA support in loops (to make better use of SSE registers; related to
Florian's note about the calling conventions). However, without the
previous changes, even FPC code compiled to LLVM IR and then compiled to
machine code with Clang (and hence with full SSA support) results in
even worse performance than the code directly compiled with FPC.
There are definitely more things (as I did not manage to get FPC's LLVM
IR to compile to a version that's equally fast as the LLVM IR generated
from the C program), but I already spent more time than is reasonable on
this. I hope the "the sky is falling" comments will stop though.
In summary, as has been mentioned by several people in this thread: you
(not directed have to you personally, Ryan) always have to check where
your program's slowness comes from, otherwise your test/benchmark is
worse than useless (because it just creates confusion, and wastes other
people's time when they get tired of mailing list getting flooded by the
same information-less statements over and over again).
Also in summary, very little was learned from this. We have known for a
long time that FPC needs SSA for better code generation for loops (and
Florian has been working on it for a long time too).
Jonas
More information about the fpc-pascal
mailing list