[fpc-pascal] FPC Graphics options?

Sat May 20 21:34:34 CEST 2017

On 19/05/17 02:54, Ryan Joseph wrote:
>> On May 18, 2017, at 10:40 PM, Jon Foster<jon-lists at jfpossibilities.com>  wrote:
>>
>> 62.44      1.33     1.33                             fpc_frac_real
>> 26.76      1.90     0.57 MATH_$$_FLOOR$EXTENDED$$LONGINT
>> 10.33      2.12     0.22                             FPC_DIV_INT64
> Thanks for profiling this.
> 
> Floor is there as I expected and 26% is pretty extreme but the others are floating point division?
> How does Java handle this so much better than FPC and what are the work arounds?
The Pascal test program that was benchmarked here contains a number of 
bugs/wrong translations from the C code (some stem from the original 
version, another one was added):
1) casting a floating point number to an int in C does not round, but 
truncates (I think this may have been mentioned earlier in the thread, I 
didn't read everything)
2) The usage of floor in the test program is wrong. C's floor takes a 
floating point number and returns one. The math unit's floor function 
takes a floating point number and returns an integer. In the Pascal 
version, this integer is then converted back to a floating point number 
because the rest of that expression also uses floating point.
3) The Pascal version uses longword instead of int32 for a number of 
variables (that are "int" in the C version). This results in one 
expression getting evaluated as 64 bit on 32 bit systems, which is where 
the FPC_DIV_INT64 calls come from (that's a routine to perform 64 bit 
*integer* divisions on 32 bit platforms)
4) frac() is only used to get a monotonous increasing value as part of 
the data input for the test program. The C code (and original Pascal 
version) uses a tick count and multiplies/divides that, which is much 
faster.

Then, there's one thing that can be done to optimize the Pascal version 
(after removing the bugs above):
1) Compile with SSE3 or higher, in particular because SSE3 can be used 
to implement trunc() with a single instruction (otherwise we pass via a 
helper that uses the x87 fpu, which moreover has to reconfigure it to 
change the rounding more and restore it afterwards). However, there does 
seem to be a bug in FPC 3.0.2 whereby compiling this program for -O2 
-Cfsse3 causes it to crash, because then it loads data from an 8-byte 
aligned location on the stack. It works fine when compiled with trunk 
and -O2 -Cfsse3 though (at least for 64 bit).

There's at least one minor twist of the classic "C compiler evaluates 
constant stuff at compile time":
1) oy and oz are constant. The "floor" function is a standard C library 
function, and hence C compilers know what it does and can evaluate it at 
compile time. Therefore, the oy-floor(oy) and oz-floor(oz) expressions 
are (equal) constants for C compilers.

Finally, there are two things FPC definitely is missing:
1) an SSE version of the int() function (which is the basis of a 
floating point version of floor()) (fairly specific to this program)
2) SSA support in loops (to make better use of SSE registers; related to 
Florian's note about the calling conventions). However, without the 
previous changes, even FPC code compiled to LLVM IR and then compiled to 
machine code with Clang (and hence with full SSA support) results in 
even worse performance than the code directly compiled with FPC.

There are definitely more things (as I did not manage to get FPC's LLVM 
IR to compile to a version that's equally fast as the LLVM IR generated 
from the C program), but I already spent more time than is reasonable on 
this. I hope the "the sky is falling" comments will stop though.

In summary, as has been mentioned by several people in this thread: you 
(not directed have to you personally, Ryan) always have to check where 
your program's slowness comes from, otherwise your test/benchmark is 
worse than useless (because it just creates confusion, and wastes other 
people's time when they get tired of mailing list getting flooded by the 
same information-less statements over and over again).

Also in summary, very little was learned from this. We have known for a 
long time that FPC needs SSA for better code generation for loops (and 
Florian has been working on it for a long time too).

Jonas