[fpc-devel] Future development plans

Tue Apr 21 20:29:20 CEST 2020

Hi everyone,

I hope this doesn't become a monthly podcast for me or something, but 
during my bursts of motivation, inspiration and creativity, I start to 
plan and research things.  There are a few things I'd like to develop 
for FPC, mostly together because there's a lot of interdependency.

* SSE/AVX intrinsics

Most of the node types for the SSE instructions have been implemented, 
as well as some wrapper functions that are disabled by default while 
their format is finalised.  The nodes that the compiler generates would 
be useful when it comes to vectorisation, since a lot of things like 
parameters and type checks will be already handled by them.  There are 
some gaps though.  For example, AVX introduced more powerful 'mask move' 
instructions that allow you to read as well as write partial vectors, 
which would be very useful when it comes to, say, optimising algorithms 
that deal with 3-component vectors (very common because 3-component 
vectors could represent 3D Cartesean coordinates or an RGB triplet, for 
example).

* Vectorisation

I think this is probably the next big iteration for the compiler and 
optimiser.  Besides the obvious loop unrolling vectorisation, there are 
a number of common algorithms that are logically easy to vectorise but 
which may take some careful analysis to actually detect.  One of my test 
cases is the classic dot product.  In raybench.pas, a 3-dimensional dot 
product appears as part of a function that returns a vector's length - 
Sqrt(V.X*V.X + V.Y*V.Y + V.Z*V.Z) - under AVX, the expression inside the 
square root can be optimised into a mask move (so only the first 3 
components of an XMM register are loaded with the fields of V and the 
4th component set to zero) and then all the additions and 
multiplications are performed with a single instruction: VDPPS XMM0, 
XMM0, XMM0, $71 - ($71 specifically says 'only multiply and horizontally 
add the first three components, and then store the result only in the 
1st component - $FF will still work since the 4th component is equal to 
zero and only the 1st component is read for the result, but is a little 
more clumsy in my opinion).

My intention, at least for these kinds of algorithms, is to make use of 
the new intrinstic nodes for specific SSE and AVX instructions, although 
there are some intrinsics missing, like the aforementioned mask move.

* Pure functions

It might be overly ambitious, but I seek to make the SSE/AVX intrinsics 
much easier to use (it easily becomes inefficient in C++ if you haven't 
got data alignments correct).  One example I came up with is using masks 
in SSE/AVX instructions.  If you want to call, say, x86_vmaskmovps (an 
intrinsic for VMASKMOVPS), you would have to set up an additional _m128 
store and load in a custom-made mask (e.g. const M128Mask: _m128 = 
(-1.0; -1.0; -1.0; 0.0); ... x86_vmaskmovps(DestAddr, M128Data, 
x86_movaps(M128Mask));).  This becomes more problematic if you need to 
specifically represent $80000000 or $FFFFFFFF in one of the 
floating-point fields (the former is negative zero, and the latter is 
one of many thousands of quiet NaN representations). An example of a 
much a cleaner solution could be x86_vmaskmovps(DestAddr, M128Data, 
[True, True, True, False]);, with an explicit typecast/assignment 
operator that converts an array of Booleans into a mask that could be 
defined and implemented somewhere in the RTL.  Nomally, this would be a 
prohibitively slow function to execute, but if the typecast/assignment 
operator was defined as a pure function, then it could be evaluated at 
design time and the resultant _m128 stored as an implicit constant that 
is loaded directly into an MM register when needed, and not having to 
task the programmer with floating-point bit manipulation in order to 
create said constant in the code.

* Aligned Allocation

This couples with SSE and AVX specifically, but has other uses such as 
with paging, for example.  Following in the footsteps of C11, I would 
like to propose a couple of new intrinsic operations: GetMemAligned and 
ReallocMemAligned, that allow you to reserve memory with an alignment of 
your choice (with the constraint that it has to be a power of 2 and at 
least the size of a Pointer). Having such intrinsics will also allow the 
FPC language itself to better support aligned dynamic arrays, for example.

C11's "aligned_alloc" is compatible with "free", while Microsoft's own 
"_aligned_malloc" is not compatible with "free" and requires its own 
"_aligned_free" call to properly release. Ideally I rather find a 
solution where GetMemAligned and ReallocMemAligned will work with 
FreeMem without having unpredictable effects.  This would be quite an 
undertaking though since it would involve deep research into the memory 
manager and ensuring all platforms have a means with which to support it.

----

I haven't fully organised myself with this yet.  Looking at these 
proposals as a dependency graph, I feel that pure functions is the 
feature that doesn't depend on everything else and I should focus my 
efforts here first.  I'll be writing up design specifications so 
hopefully everyone else can understand what's going on and either throw 
in suggestions, note where performance can be improved or plain shoot 
something down if it's a very bad idea.

My personal vision... I would like to see Free Pascal being relatively 
easy to use while still allowing access to powerful features like 
intrinsics and having a powerful optimising compiler so games and 
scientific programming can greatly benefit.

What are everyone's thoughts?

Gareth aka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus