[fpc-devel] Future development plans
J. Gareth Moreton
gareth at moreton-family.com
Tue Apr 21 20:29:20 CEST 2020
Hi everyone,
I hope this doesn't become a monthly podcast for me or something, but
during my bursts of motivation, inspiration and creativity, I start to
plan and research things. There are a few things I'd like to develop
for FPC, mostly together because there's a lot of interdependency.
* SSE/AVX intrinsics
Most of the node types for the SSE instructions have been implemented,
as well as some wrapper functions that are disabled by default while
their format is finalised. The nodes that the compiler generates would
be useful when it comes to vectorisation, since a lot of things like
parameters and type checks will be already handled by them. There are
some gaps though. For example, AVX introduced more powerful 'mask move'
instructions that allow you to read as well as write partial vectors,
which would be very useful when it comes to, say, optimising algorithms
that deal with 3-component vectors (very common because 3-component
vectors could represent 3D Cartesean coordinates or an RGB triplet, for
example).
* Vectorisation
I think this is probably the next big iteration for the compiler and
optimiser. Besides the obvious loop unrolling vectorisation, there are
a number of common algorithms that are logically easy to vectorise but
which may take some careful analysis to actually detect. One of my test
cases is the classic dot product. In raybench.pas, a 3-dimensional dot
product appears as part of a function that returns a vector's length -
Sqrt(V.X*V.X + V.Y*V.Y + V.Z*V.Z) - under AVX, the expression inside the
square root can be optimised into a mask move (so only the first 3
components of an XMM register are loaded with the fields of V and the
4th component set to zero) and then all the additions and
multiplications are performed with a single instruction: VDPPS XMM0,
XMM0, XMM0, $71 - ($71 specifically says 'only multiply and horizontally
add the first three components, and then store the result only in the
1st component - $FF will still work since the 4th component is equal to
zero and only the 1st component is read for the result, but is a little
more clumsy in my opinion).
My intention, at least for these kinds of algorithms, is to make use of
the new intrinstic nodes for specific SSE and AVX instructions, although
there are some intrinsics missing, like the aforementioned mask move.
* Pure functions
It might be overly ambitious, but I seek to make the SSE/AVX intrinsics
much easier to use (it easily becomes inefficient in C++ if you haven't
got data alignments correct). One example I came up with is using masks
in SSE/AVX instructions. If you want to call, say, x86_vmaskmovps (an
intrinsic for VMASKMOVPS), you would have to set up an additional _m128
store and load in a custom-made mask (e.g. const M128Mask: _m128 =
(-1.0; -1.0; -1.0; 0.0); ... x86_vmaskmovps(DestAddr, M128Data,
x86_movaps(M128Mask));). This becomes more problematic if you need to
specifically represent $80000000 or $FFFFFFFF in one of the
floating-point fields (the former is negative zero, and the latter is
one of many thousands of quiet NaN representations). An example of a
much a cleaner solution could be x86_vmaskmovps(DestAddr, M128Data,
[True, True, True, False]);, with an explicit typecast/assignment
operator that converts an array of Booleans into a mask that could be
defined and implemented somewhere in the RTL. Nomally, this would be a
prohibitively slow function to execute, but if the typecast/assignment
operator was defined as a pure function, then it could be evaluated at
design time and the resultant _m128 stored as an implicit constant that
is loaded directly into an MM register when needed, and not having to
task the programmer with floating-point bit manipulation in order to
create said constant in the code.
* Aligned Allocation
This couples with SSE and AVX specifically, but has other uses such as
with paging, for example. Following in the footsteps of C11, I would
like to propose a couple of new intrinsic operations: GetMemAligned and
ReallocMemAligned, that allow you to reserve memory with an alignment of
your choice (with the constraint that it has to be a power of 2 and at
least the size of a Pointer). Having such intrinsics will also allow the
FPC language itself to better support aligned dynamic arrays, for example.
C11's "aligned_alloc" is compatible with "free", while Microsoft's own
"_aligned_malloc" is not compatible with "free" and requires its own
"_aligned_free" call to properly release. Ideally I rather find a
solution where GetMemAligned and ReallocMemAligned will work with
FreeMem without having unpredictable effects. This would be quite an
undertaking though since it would involve deep research into the memory
manager and ensuring all platforms have a means with which to support it.
----
I haven't fully organised myself with this yet. Looking at these
proposals as a dependency graph, I feel that pure functions is the
feature that doesn't depend on everything else and I should focus my
efforts here first. I'll be writing up design specifications so
hopefully everyone else can understand what's going on and either throw
in suggestions, note where performance can be improved or plain shoot
something down if it's a very bad idea.
My personal vision... I would like to see Free Pascal being relatively
easy to use while still allowing access to powerful features like
intrinsics and having a powerful optimising compiler so games and
scientific programming can greatly benefit.
What are everyone's thoughts?
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list