[fpc-devel] vmul commutative optimization?
J. Gareth Moreton
gareth at moreton-family.com
Thu Nov 14 01:14:31 CET 2019
On 13/11/2019 16:03, Marco van de Voort wrote:
> Op 2019-11-12 om 20:46 schreef J. Gareth Moreton:
>> The Microsoft ABI is a bit restrictive when it comes to record types;
>> as described here
>> "Structs and unions of size 8, 16, 32, or 64 bits, and __m64 types,
>> are passed as if they were integers of the same size." So
>> unfortunately, a single-precision complex number is treated as a
>> 64-bit structure and passed as an integer. The System V ABI, on the
>> other hand, would pass the two entries through the lower 64 bits of
>> XMM0. Vectorcall, theoretically, should put the two components into
>> XMM0 and XMM1, because the complex type would be considered a
>> "homogeneous vector aggregate" (with floats as 1-dimensional vectors).
> I've found refs like
> so the question is if partial vectors (and specially 2 single 8-byte,
> since there are various special SSE opcodes to deal with them) are one
> register or not. The references I found usually talk about "vector
> types like _m128 and _m256), but don't really specify an exhaustive list.
> I guess that means testing with VS?
Testing with Visual Studio or even GCC under Windows is a good idea if
you want to be sure how particular record types are transferred. The
example given in that article has two fields of type __m128, even though
it looks like only one of the four vector elements are used initially.
Regardless, under the default Microsoft calling convention, that would
be passed by reference, just like a record of two Doubles. A (packed)
record of two Singles would be passed by value in an integer register,
just to cause trouble with conversions!
But to give a clear answer, under fastcall (the default Microsoft ABI),
a record of two Singles will be passed by value through an integer
register (RCX if it's the first parameter), and a record of two Doubles
will be passed by reference (pointer in RCX if it's the first
parameter). Under vectorcall, a record of two Singles would be treated
as a Homogeneous Float Aggregate and pass the two fields in XMM0 and
XMM1, and the same thing happens with an unaligned record of two
Doubles. If a record of two Doubles is aligned to a 16-byte boundary
though, or is otherwise a union with a __m128 type (with the two Doubles
aliased to the lower and upper 64 bits respectively), then it can be
passed in its entirety through XMM0.
Some things are a little bit messy and opaque with __m128 though, and
just making an aligned array of 4 Singles or 2 Doubles doesn't always
work - it needs to be typecast through __m128 in some way - but I think
that's mostly because C++ wasn't really designed with alignment in
mind. In Free Pascal, you have to make a bit of a messy union to ensure
everything works; for example:
type align_dummy = record
filler: array[0..1] of Double;
type complex = record
case Byte of
re : real;
im : real;
Trying to apply RECORDMIN=16 to complex directly just puts each field on
a 16-byte boundary. It's why I proposed allowing "align 16" between
"end" and the semicolon so it can be done with relative ease, since it's
so easy to get wrong and it's not obvious that it's wrong until you
measure performance benchmarks and look at the disassembly.
Gareth aka. Kit
More information about the fpc-devel