[fpc-devel] State of SSE/AVX intrinsics

Tue Apr 21 04:46:13 CEST 2020

Hi everyone,

So to start the story, I'm planning to make use of the nodes that were 
introduced for the SSE and AVX intrinstics as part of some vectorisation 
code, since they can help manage the code generation and contain some 
sanity checks.  I noticed though that some intrinsics are missing; for 
example, the VMASKMOV instructions, which were introduced with AVX and 
don't have a direct SSE equivalent (MASKMOVDQU and MASKMOVQ can only 
write to memory and work on integers rather than floating-point values, 
and mixing MM integer and floating-point instructions incur a CPU state 
switch penalty).

Instructions like VMASKMOV are very useful because it would allow 
vectorisation of 3-component arrays (e.g. Cartesean coordinates), so I 
plan to look at introducing nodes for these instructions. Would this be 
okay to do?

On another note, I'm wondering if the RTL could benefit from some 
initializers for the MM types for ease of use.  For example, one of the 
registers used in VMASKMOV is a mask, and for a programmer using 
intrinstics, being able to do something like "mmval := 
x86_vmaskmovps(Coord, [True, True, True, False]);" - granted I have to 
think about performance since all those Boolean consrants should ideally 
be merged into single 128-bit memory block (as 
FFFFFFFFFFFFFFFFFFFFFFFF00000000) that is loaded into an XMM register 
with VMOVPS.

What would you suggest? I'm just speaking a bit from experience in that 
using C++ intrinsics can get a little cumbersome sometimes and easy to 
get wrong (at least as far as performance and alignment are concerned, 
for example), and having the FPC ones be friendlier would make a world 
of difference.

I'm still working out quite a few things and experimenting a lot.  I'll 
be sure to be doing a lot of documentation.  However, any help or 
insight into the current design practices for intrinstics and their 
respective nodes will be greatly appreciated.

Gareth aka. Kit

P.S. Regarding vectorisation challenges, I'm looking at sequences like 
"V.X*V.X + V.Y*V.Y + V.Z*V.Z" (scalar length of a 3-dimensional vector), 
which I would love to be able to naturally compile into:

VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VMULPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0

And then maybe take it further to produce:

VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VDPPS XMM0, XMM0, $71 { 01110001b }

(This could be an optimisation at the node level rather than a peephole 
optimisation, although if it doesn't know exactly what VMASKMOVPS is 
doing, then the immediate in (V)DPPS will be forced to be $FF)

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus