[fpc-pascal] FPC Graphics options?

Nikolay Nikolov nickysn at gmail.com
Tue May 23 02:03:40 CEST 2017



On 05/23/2017 01:20 AM, noreply at z505.com wrote:
> On 2017-05-18 19:54, Ryan Joseph wrote:
>>> On May 18, 2017, at 10:40 PM, Jon Foster 
>>> <jon-lists at jfpossibilities.com> wrote:
>>>
>>> 62.44      1.33     1.33 fpc_frac_real
>>> 26.76      1.90     0.57 MATH_$$_FLOOR$EXTENDED$$LONGINT
>>> 10.33      2.12     0.22 FPC_DIV_INT64
>>
>> Thanks for profiling this.
>>
>> Floor is there as I expected and 26% is pretty extreme but the others
>> are floating point division? How does Java handle this so much better
>> than FPC and what are the work arounds? Just curious. As it stands I
>> can only reason that I need to avoid dividing floats in FPC like the
>> plague.
>>
>
> Isn't java just a wrapper around C?
No. Java compilers generate code for a virtual machine, called JVM (Java 
Virtual Machine). They do not generate code for x86 CPUs or any other 
real CPU. The JVM is like a fictional CPU, that does not exist in a 
silicon implementation anywhere, but is implemented in software only.

C compilers usually generate native code for real CPUs, just like FPC does.

Why does it matter? The x86 instruction set architecture has gone 
through quite a long evolution and there are many instruction set 
extensions, that were added along the way: 32-bit extensions (x86 
originally started as 16-bit), the x87 FPU instructions (this was a 
separate coprocessor in the beginning, but later became integrated into 
the main CPU starting from the 486DX onwards), MMX, SSE, SSE2, the 
64-bit extensions (x86_64), SSE3, AVX, etc.

There are generally two ways to do floating point on the x86:
   - the x87 FPU - this is used by default by the FPC compiler on 32-bit 
(and 16-bit) x86
   - the SSE2 instruction set extension - this can replace the FPU and 
generally works faster on modern CPUs. This is used by default by the 
64-bit FPC compiler. That's because all 64-bit x86 CPUs support this 
extension.

There is one disadvantage to using SSE2 instead of the x87 FPU - the 
SSE2 instructions don't support the 80-bit extended precision float 
type. There's no support for it in any of the later x86 instruction set 
extensions either. If you need the 80-bit precision, the x87 FPU is the 
only way to go, even on x86_64.

There's another disadvantage to using SSE2 by default on 32-bit x86 - 
programs, compiled for SSE2 will not run on older CPUs, which don't 
support SSE2. There's simply no way around that. Therefore, we cannot 
make use of SSE2 by default, without sacrificing backwards 
compatibility. The only exception to that are certain RTL routines, like 
Move() or FillChar() which take advantage of the SSE2 extensions, 
because they check the CPU capabilities at runtime and internally 
dispatch to several different implementations, for different CPU types, 
which are all compiled and linked in. But you simply cannot take this 
approach for every FPU operation, because if you do a CPU check on every 
floating point calculation, the overhead of all the checks will make 
your program slower that it would be, if you simply used the x87 FPU 
instructions for example.

Virtual machines like the JVM don't have this problem and they can 
always take advantage of newer instruction set extensions, without 
sacrificing backward compatibility with older CPUs. Why? Because the JVM 
bytecode has nothing to do with any processor at all. When you run your 
program, the JVM bytecode is converted ("Just-In-Time" compiled) to 
native code for the CPU the user has. So, if the user is running your 
Java program on a CPU, that has SSE3, the JIT compiler will know it can 
use SSE2 and SSE3 instructions. If another person runs it on an older 
CPU, which doesn't have SSE2, the JIT compiler will compile it to use 
x87 FPU instructions. Sounds so great, you're going to ask if there are 
any disadvantages to this approach? And, of course, there are - since 
the program is essentially recompiled every time the user runs it, 
starting Java programs take a long time. There's also limited time that 
the JIT compiler can spend on optimization (otherwise programs will 
start even slower). There are ways to combat that, by using some sort of 
cache (.NET has the global assembly cache), but they are far from 
perfect either - these caches eat a lot of disk space and then either 
program installation or the first time it is run (when the JIT compiled 
assembly hasn't been added to the cache) becomes slow. In general native 
programs (FPC and C programs) feel a lot snappier to most users, because 
they start fast. But in the highly specific case of heavy floating point 
code (where SSE2 vs x87 FPU instruction sets matter), a native program 
(C or Pascal) compiled for the x87 FPU will be slower than the JVM, 
because the JVM will use SSE2 and SSE3 on modern CPUs.

Does this mean that it's always better to use the JVM? No. I mean, if it 
suits you, go ahead and use it, there's nothing wrong with it (even FPC 
supports it as a target: http://wiki.freepascal.org/FPC_JVM ), but there 
are a lot of options for using native code as well:
- if SSE2 and SSE3 make a huge performance difference for your program, 
and you don't need to support old CPUs (e.g. your users are happy about 
it or your program would be too slow to be usable on these CPUs anyway, 
since you need a lot of CPU performance), then enable {$fputype sse3} 
and probably recompile the RTL with it, to take full advantage of it.
- if SSE2 and SSE3 (or AVX or whatever new extension) make a huge 
performance difference, but old CPU support is still valuable for your 
users, then compile and provide two .exe files - one for old CPUs and 
one for new ones.
- if SSE2 and SSE3 don't make a difference, then you're not writing 
floating point heavy code and you're happy with the default settings :) 
The compatibility with older CPUs is only a bonus in this case and isn't 
hurting your performance on new CPUs.

And, of course, it is easy to give examples, where a Java program would 
be a lot slower than a FPC program. I know comparing different IDEs is a 
little apples-to-oranges comparison (because they may have different 
features and vastly different implementation details), but compare the 
speed of e.g. Lazarus to any IDE, written in Java, even the fastest one. :)

Anyhow, enough ranting, already. Just remember the golden rule of 
optimization: never assume.

Always measure and try to understand why something is slow. In 99% of 
the cases it's not what people initially think.

Nikolay



More information about the fpc-pascal mailing list