[fpc-devel] vmul commutative optimization?

Thu Nov 14 01:14:31 CET 2019

On 13/11/2019 16:03, Marco van de Voort wrote:
>
> Op 2019-11-12 om 20:46 schreef J. Gareth Moreton:
>>
>> The Microsoft ABI is a bit restrictive when it comes to record types; 
>> as described here 
>> <https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019>, 
>> "Structs and unions of size 8, 16, 32, or 64 bits, and __m64 types, 
>> are passed as if they were integers of the same size." So 
>> unfortunately, a single-precision complex number is treated as a 
>> 64-bit structure and passed as an integer.  The System V ABI, on the 
>> other hand, would pass the two entries through the lower 64 bits of 
>> XMM0.  Vectorcall, theoretically, should put the two components into 
>> XMM0 and XMM1, because the complex type would be considered a 
>> "homogeneous vector aggregate" (with floats as 1-dimensional vectors).
>>
> I've found refs like 
> https://devblogs.microsoft.com/cppblog/introducing-vector-calling-convention/#comments 
> so the question is if partial vectors (and specially 2 single 8-byte, 
> since there are various special SSE opcodes to deal with them) are one 
> register or not. The references I found usually talk about "vector 
> types like _m128 and _m256), but don't really specify an exhaustive list.
>
>
> I guess that means testing with VS?

Testing with Visual Studio or even GCC under Windows is a good idea if 
you want to be sure how particular record types are transferred.  The 
example given in that article has two fields of type __m128, even though 
it looks like only one of the four vector elements are used initially.  
Regardless, under the default Microsoft calling convention, that would 
be passed by reference, just like a record of two Doubles.  A (packed) 
record of two Singles would be passed by value in an integer register, 
just to cause trouble with conversions!

But to give a clear answer, under fastcall (the default Microsoft ABI), 
a record of two Singles will be passed by value through an integer 
register (RCX if it's the first parameter), and a record of two Doubles 
will be passed by reference (pointer in RCX if it's the first 
parameter).  Under vectorcall, a record of two Singles would be treated 
as a Homogeneous Float Aggregate and pass the two fields in XMM0 and 
XMM1, and the same thing happens with an unaligned record of two 
Doubles.  If a record of two Doubles is aligned to a 16-byte boundary 
though, or is otherwise a union with a __m128 type (with the two Doubles 
aliased to the lower and upper 64 bits respectively), then it can be 
passed in its entirety through XMM0.

Some things are a little bit messy and opaque with __m128 though, and 
just making an aligned array of 4 Singles or 2 Doubles doesn't always 
work - it needs to be typecast through __m128 in some way - but I think 
that's mostly because C++ wasn't really designed with alignment in 
mind.  In Free Pascal, you have to make a bit of a messy union to ensure 
everything works; for example:

{$push}
{$codealign RECORDMIN=16}
{$PACKRECORDS C}
   type align_dummy = record
      filler: array[0..1] of Double;
    end;
{$pop}

  type complex = record
                   case Byte of
                   0: (
                        alignment: align_dummy;
                      );
                   1: (
                        re : real;
                        im : real;
                      );
                 end;

Trying to apply RECORDMIN=16 to complex directly just puts each field on 
a 16-byte boundary.  It's why I proposed allowing "align 16" between 
"end" and the semicolon so it can be done with relative ease, since it's 
so easy to get wrong and it's not obvious that it's wrong until you 
measure performance benchmarks and look at the disassembly.

Gareth aka. Kit