[fpc-devel] State of SSE/AVX intrinsics
J. Gareth Moreton
gareth at moreton-family.com
Tue Apr 21 04:46:13 CEST 2020
Hi everyone,
So to start the story, I'm planning to make use of the nodes that were
introduced for the SSE and AVX intrinstics as part of some vectorisation
code, since they can help manage the code generation and contain some
sanity checks. I noticed though that some intrinsics are missing; for
example, the VMASKMOV instructions, which were introduced with AVX and
don't have a direct SSE equivalent (MASKMOVDQU and MASKMOVQ can only
write to memory and work on integers rather than floating-point values,
and mixing MM integer and floating-point instructions incur a CPU state
switch penalty).
Instructions like VMASKMOV are very useful because it would allow
vectorisation of 3-component arrays (e.g. Cartesean coordinates), so I
plan to look at introducing nodes for these instructions. Would this be
okay to do?
On another note, I'm wondering if the RTL could benefit from some
initializers for the MM types for ease of use. For example, one of the
registers used in VMASKMOV is a mask, and for a programmer using
intrinstics, being able to do something like "mmval :=
x86_vmaskmovps(Coord, [True, True, True, False]);" - granted I have to
think about performance since all those Boolean consrants should ideally
be merged into single 128-bit memory block (as
FFFFFFFFFFFFFFFFFFFFFFFF00000000) that is loaded into an XMM register
with VMOVPS.
What would you suggest? I'm just speaking a bit from experience in that
using C++ intrinsics can get a little cumbersome sometimes and easy to
get wrong (at least as far as performance and alignment are concerned,
for example), and having the FPC ones be friendlier would make a world
of difference.
I'm still working out quite a few things and experimenting a lot. I'll
be sure to be doing a lot of documentation. However, any help or
insight into the current design practices for intrinstics and their
respective nodes will be greatly appreciated.
Gareth aka. Kit
P.S. Regarding vectorisation challenges, I'm looking at sequences like
"V.X*V.X + V.Y*V.Y + V.Z*V.Z" (scalar length of a 3-dimensional vector),
which I would love to be able to naturally compile into:
VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VMULPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0
VHADDPS XMM0, XMM0, XMM0
And then maybe take it further to produce:
VMOVPS XMM1, Mask_1110
VMASKMOVPS XMM0, XMM1, V
VDPPS XMM0, XMM0, $71 { 01110001b }
(This could be an optimisation at the node level rather than a peephole
optimisation, although if it doesn't know exactly what VMASKMOVPS is
doing, then the immediate in (V)DPPS will be forced to be $FF)
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list