[fpc-devel] Future development plans

Tue Apr 21 20:44:30 CEST 2020

One issue with the current state of the intrinsics is that they don't 
really follow the common style among other languages, and there's no 
agreed consensus about what path to take yet. To implement the more 
common style would be a lot of work though compared to the current 
autogenerated way

Adding the AVX/AVX2 intrinsics isn't hard. I think I have done it on a 
branch somewhere including a bunch of fixes.

On 4/21/20 8:29 PM, J. Gareth Moreton wrote:
> Hi everyone,
>
> I hope this doesn't become a monthly podcast for me or something, but 
> during my bursts of motivation, inspiration and creativity, I start to 
> plan and research things.  There are a few things I'd like to develop 
> for FPC, mostly together because there's a lot of interdependency.
>
> * SSE/AVX intrinsics
>
> Most of the node types for the SSE instructions have been implemented, 
> as well as some wrapper functions that are disabled by default while 
> their format is finalised.  The nodes that the compiler generates 
> would be useful when it comes to vectorisation, since a lot of things 
> like parameters and type checks will be already handled by them.  
> There are some gaps though.  For example, AVX introduced more powerful 
> 'mask move' instructions that allow you to read as well as write 
> partial vectors, which would be very useful when it comes to, say, 
> optimising algorithms that deal with 3-component vectors (very common 
> because 3-component vectors could represent 3D Cartesean coordinates 
> or an RGB triplet, for example).
>
> * Vectorisation
>
> I think this is probably the next big iteration for the compiler and 
> optimiser.  Besides the obvious loop unrolling vectorisation, there 
> are a number of common algorithms that are logically easy to vectorise 
> but which may take some careful analysis to actually detect.  One of 
> my test cases is the classic dot product.  In raybench.pas, a 
> 3-dimensional dot product appears as part of a function that returns a 
> vector's length - Sqrt(V.X*V.X + V.Y*V.Y + V.Z*V.Z) - under AVX, the 
> expression inside the square root can be optimised into a mask move 
> (so only the first 3 components of an XMM register are loaded with the 
> fields of V and the 4th component set to zero) and then all the 
> additions and multiplications are performed with a single instruction: 
> VDPPS XMM0, XMM0, XMM0, $71 - ($71 specifically says 'only multiply 
> and horizontally add the first three components, and then store the 
> result only in the 1st component - $FF will still work since the 4th 
> component is equal to zero and only the 1st component is read for the 
> result, but is a little more clumsy in my opinion).
>
> My intention, at least for these kinds of algorithms, is to make use 
> of the new intrinstic nodes for specific SSE and AVX instructions, 
> although there are some intrinsics missing, like the aforementioned 
> mask move.
>
> * Pure functions
>
> It might be overly ambitious, but I seek to make the SSE/AVX 
> intrinsics much easier to use (it easily becomes inefficient in C++ if 
> you haven't got data alignments correct).  One example I came up with 
> is using masks in SSE/AVX instructions.  If you want to call, say, 
> x86_vmaskmovps (an intrinsic for VMASKMOVPS), you would have to set up 
> an additional _m128 store and load in a custom-made mask (e.g. const 
> M128Mask: _m128 = (-1.0; -1.0; -1.0; 0.0); ... 
> x86_vmaskmovps(DestAddr, M128Data, x86_movaps(M128Mask));).  This 
> becomes more problematic if you need to specifically represent 
> $80000000 or $FFFFFFFF in one of the floating-point fields (the former 
> is negative zero, and the latter is one of many thousands of quiet NaN 
> representations). An example of a much a cleaner solution could be 
> x86_vmaskmovps(DestAddr, M128Data, [True, True, True, False]);, with 
> an explicit typecast/assignment operator that converts an array of 
> Booleans into a mask that could be defined and implemented somewhere 
> in the RTL.  Nomally, this would be a prohibitively slow function to 
> execute, but if the typecast/assignment operator was defined as a pure 
> function, then it could be evaluated at design time and the resultant 
> _m128 stored as an implicit constant that is loaded directly into an 
> MM register when needed, and not having to task the programmer with 
> floating-point bit manipulation in order to create said constant in 
> the code.
>
> * Aligned Allocation
>
> This couples with SSE and AVX specifically, but has other uses such as 
> with paging, for example.  Following in the footsteps of C11, I would 
> like to propose a couple of new intrinsic operations: GetMemAligned 
> and ReallocMemAligned, that allow you to reserve memory with an 
> alignment of your choice (with the constraint that it has to be a 
> power of 2 and at least the size of a Pointer). Having such intrinsics 
> will also allow the FPC language itself to better support aligned 
> dynamic arrays, for example.
>
> C11's "aligned_alloc" is compatible with "free", while Microsoft's own 
> "_aligned_malloc" is not compatible with "free" and requires its own 
> "_aligned_free" call to properly release. Ideally I rather find a 
> solution where GetMemAligned and ReallocMemAligned will work with 
> FreeMem without having unpredictable effects.  This would be quite an 
> undertaking though since it would involve deep research into the 
> memory manager and ensuring all platforms have a means with which to 
> support it.
>
> ----
>
> I haven't fully organised myself with this yet.  Looking at these 
> proposals as a dependency graph, I feel that pure functions is the 
> feature that doesn't depend on everything else and I should focus my 
> efforts here first.  I'll be writing up design specifications so 
> hopefully everyone else can understand what's going on and either 
> throw in suggestions, note where performance can be improved or plain 
> shoot something down if it's a very bad idea.
>
> My personal vision... I would like to see Free Pascal being relatively 
> easy to use while still allowing access to powerful features like 
> intrinsics and having a powerful optimising compiler so games and 
> scientific programming can greatly benefit.
>
> What are everyone's thoughts?
>
> Gareth aka. Kit
>
>