[fpc-devel] vmul commutative optimization?

Fri Nov 15 13:01:36 CET 2019

Op 14/11/2019 om 01:14 schreef J. Gareth Moreton:
>
>> I guess that means testing with VS?
>
> Testing with Visual Studio or even GCC under Windows is a good idea if 
> you want to be sure how particular record types are transferred.  The 
> example given in that article has two fields of type __m128, even 
> though it looks like only one of the four vector elements are used 
> initially.  Regardless, under the default Microsoft calling 
> convention, that would be passed by reference, just like a record of 
> two Doubles.  A (packed) record of two Singles would be passed by 
> value in an integer register, just to cause trouble with conversions!
>
To be clear: I meant if  2 single 64-bit vectors are registered in XMM 
instead of integer fields with vectorcall

It was more meant as a research point, I don't need it anymore. After 
realizing that I either need autovectorizing or intrinsics I simply 
started doing a simple translation to assembler, a naive 1:1 translation 
(but then with complex as two singles in an XMM). Bit of fiddling to 
define multiplying with j in xmm assembler (Doing NOT on one of both 
singles), but otherwise simple.

I got the first stage (the radix funtions for the radices that I use, 
4,5,10) and got things working, and both speed and instruction count 
divided by 3.  (not entirely 100% logical, since the asm version has 
relatively more complex instructions).

> Under vectorcall, a record of two Singles would be treated as a 
> Homogeneous Float Aggregate and pass the two fields in XMM0 and XMM1

Afaik FPC doesn't do that yet. It passed in an int  register. Pity. as 
_m64 register it would have been nice for complex-with-singles.

> , and the same thing happens with an unaligned record of two Doubles.  
> If a record of two Doubles is aligned to a 16-byte boundary though, or 
> is otherwise a union with a __m128 type (with the two Doubles aliased 
> to the lower and upper 64 bits respectively), then it can be passed in 
> its entirety through XMM0.
>
> Some things are a little bit messy and opaque with __m128 though, and 
> just making an aligned array of 4 Singles or 2 Doubles doesn't always 
> work - it needs to be typecast through __m128 in some way - but I 
> think that's mostly because C++ wasn't really designed with alignment 
> in mind.  In Free Pascal, you have to make a bit of a messy union to 
> ensure everything works; for example:
>
I already use that union copied from your patch, but then changed to 
singles. But doesn't do much.