[fpc-devel] Future development plans

J. Gareth Moreton gareth at moreton-family.com
Wed Apr 22 12:10:11 CEST 2020

We don't have to follow other languages exactly.  For example, I prefer 
"TM128" or "M128" over the C-like "_m128".

Personally, I like the way you've named the intrinsics because it's very 
clear which instruction an intrinstic maps to, and which architecture it 
belongs to.  The operand order can be confusing sometimes, especially I 
was brought up on Intel ordering, but I think this can easily be 
addressed with meaningful parameter names.

Gareth aka. Kit

On 21/04/2020 19:44, Jeppe Johansen wrote:
> One issue with the current state of the intrinsics is that they don't 
> really follow the common style among other languages, and there's no 
> agreed consensus about what path to take yet. To implement the more 
> common style would be a lot of work though compared to the current 
> autogenerated way
> Adding the AVX/AVX2 intrinsics isn't hard. I think I have done it on a 
> branch somewhere including a bunch of fixes.
> On 4/21/20 8:29 PM, J. Gareth Moreton wrote:
>> Hi everyone,
>> I hope this doesn't become a monthly podcast for me or something, but 
>> during my bursts of motivation, inspiration and creativity, I start 
>> to plan and research things.  There are a few things I'd like to 
>> develop for FPC, mostly together because there's a lot of 
>> interdependency.
>> * SSE/AVX intrinsics
>> Most of the node types for the SSE instructions have been 
>> implemented, as well as some wrapper functions that are disabled by 
>> default while their format is finalised.  The nodes that the compiler 
>> generates would be useful when it comes to vectorisation, since a lot 
>> of things like parameters and type checks will be already handled by 
>> them.  There are some gaps though.  For example, AVX introduced more 
>> powerful 'mask move' instructions that allow you to read as well as 
>> write partial vectors, which would be very useful when it comes to, 
>> say, optimising algorithms that deal with 3-component vectors (very 
>> common because 3-component vectors could represent 3D Cartesean 
>> coordinates or an RGB triplet, for example).
>> * Vectorisation
>> I think this is probably the next big iteration for the compiler and 
>> optimiser.  Besides the obvious loop unrolling vectorisation, there 
>> are a number of common algorithms that are logically easy to 
>> vectorise but which may take some careful analysis to actually 
>> detect.  One of my test cases is the classic dot product.  In 
>> raybench.pas, a 3-dimensional dot product appears as part of a 
>> function that returns a vector's length - Sqrt(V.X*V.X + V.Y*V.Y + 
>> V.Z*V.Z) - under AVX, the expression inside the square root can be 
>> optimised into a mask move (so only the first 3 components of an XMM 
>> register are loaded with the fields of V and the 4th component set to 
>> zero) and then all the additions and multiplications are performed 
>> with a single instruction: VDPPS XMM0, XMM0, XMM0, $71 - ($71 
>> specifically says 'only multiply and horizontally add the first three 
>> components, and then store the result only in the 1st component - $FF 
>> will still work since the 4th component is equal to zero and only the 
>> 1st component is read for the result, but is a little more clumsy in 
>> my opinion).
>> My intention, at least for these kinds of algorithms, is to make use 
>> of the new intrinstic nodes for specific SSE and AVX instructions, 
>> although there are some intrinsics missing, like the aforementioned 
>> mask move.
>> * Pure functions
>> It might be overly ambitious, but I seek to make the SSE/AVX 
>> intrinsics much easier to use (it easily becomes inefficient in C++ 
>> if you haven't got data alignments correct).  One example I came up 
>> with is using masks in SSE/AVX instructions.  If you want to call, 
>> say, x86_vmaskmovps (an intrinsic for VMASKMOVPS), you would have to 
>> set up an additional _m128 store and load in a custom-made mask (e.g. 
>> const M128Mask: _m128 = (-1.0; -1.0; -1.0; 0.0); ... 
>> x86_vmaskmovps(DestAddr, M128Data, x86_movaps(M128Mask));).  This 
>> becomes more problematic if you need to specifically represent 
>> $80000000 or $FFFFFFFF in one of the floating-point fields (the 
>> former is negative zero, and the latter is one of many thousands of 
>> quiet NaN representations). An example of a much a cleaner solution 
>> could be x86_vmaskmovps(DestAddr, M128Data, [True, True, True, 
>> False]);, with an explicit typecast/assignment operator that converts 
>> an array of Booleans into a mask that could be defined and 
>> implemented somewhere in the RTL.  Nomally, this would be a 
>> prohibitively slow function to execute, but if the 
>> typecast/assignment operator was defined as a pure function, then it 
>> could be evaluated at design time and the resultant _m128 stored as 
>> an implicit constant that is loaded directly into an MM register when 
>> needed, and not having to task the programmer with floating-point bit 
>> manipulation in order to create said constant in the code.
>> * Aligned Allocation
>> This couples with SSE and AVX specifically, but has other uses such 
>> as with paging, for example.  Following in the footsteps of C11, I 
>> would like to propose a couple of new intrinsic operations: 
>> GetMemAligned and ReallocMemAligned, that allow you to reserve memory 
>> with an alignment of your choice (with the constraint that it has to 
>> be a power of 2 and at least the size of a Pointer). Having such 
>> intrinsics will also allow the FPC language itself to better support 
>> aligned dynamic arrays, for example.
>> C11's "aligned_alloc" is compatible with "free", while Microsoft's 
>> own "_aligned_malloc" is not compatible with "free" and requires its 
>> own "_aligned_free" call to properly release. Ideally I rather find a 
>> solution where GetMemAligned and ReallocMemAligned will work with 
>> FreeMem without having unpredictable effects.  This would be quite an 
>> undertaking though since it would involve deep research into the 
>> memory manager and ensuring all platforms have a means with which to 
>> support it.
>> ----
>> I haven't fully organised myself with this yet.  Looking at these 
>> proposals as a dependency graph, I feel that pure functions is the 
>> feature that doesn't depend on everything else and I should focus my 
>> efforts here first.  I'll be writing up design specifications so 
>> hopefully everyone else can understand what's going on and either 
>> throw in suggestions, note where performance can be improved or plain 
>> shoot something down if it's a very bad idea.
>> My personal vision... I would like to see Free Pascal being 
>> relatively easy to use while still allowing access to powerful 
>> features like intrinsics and having a powerful optimising compiler so 
>> games and scientific programming can greatly benefit.
>> What are everyone's thoughts?
>> Gareth aka. Kit

This email has been checked for viruses by Avast antivirus software.

More information about the fpc-devel mailing list