[fpc-devel] vmul commutative optimization?
Marco van de Voort
core at pascalprogramming.org
Fri Nov 15 13:01:36 CET 2019
Op 14/11/2019 om 01:14 schreef J. Gareth Moreton:
>> I guess that means testing with VS?
> Testing with Visual Studio or even GCC under Windows is a good idea if
> you want to be sure how particular record types are transferred. The
> example given in that article has two fields of type __m128, even
> though it looks like only one of the four vector elements are used
> initially. Regardless, under the default Microsoft calling
> convention, that would be passed by reference, just like a record of
> two Doubles. A (packed) record of two Singles would be passed by
> value in an integer register, just to cause trouble with conversions!
To be clear: I meant if 2 single 64-bit vectors are registered in XMM
instead of integer fields with vectorcall
It was more meant as a research point, I don't need it anymore. After
realizing that I either need autovectorizing or intrinsics I simply
started doing a simple translation to assembler, a naive 1:1 translation
(but then with complex as two singles in an XMM). Bit of fiddling to
define multiplying with j in xmm assembler (Doing NOT on one of both
singles), but otherwise simple.
I got the first stage (the radix funtions for the radices that I use,
4,5,10) and got things working, and both speed and instruction count
divided by 3. (not entirely 100% logical, since the asm version has
relatively more complex instructions).
> Under vectorcall, a record of two Singles would be treated as a
> Homogeneous Float Aggregate and pass the two fields in XMM0 and XMM1
Afaik FPC doesn't do that yet. It passed in an int register. Pity. as
_m64 register it would have been nice for complex-with-singles.
> , and the same thing happens with an unaligned record of two Doubles.
> If a record of two Doubles is aligned to a 16-byte boundary though, or
> is otherwise a union with a __m128 type (with the two Doubles aliased
> to the lower and upper 64 bits respectively), then it can be passed in
> its entirety through XMM0.
> Some things are a little bit messy and opaque with __m128 though, and
> just making an aligned array of 4 Singles or 2 Doubles doesn't always
> work - it needs to be typecast through __m128 in some way - but I
> think that's mostly because C++ wasn't really designed with alignment
> in mind. In Free Pascal, you have to make a bit of a messy union to
> ensure everything works; for example:
I already use that union copied from your patch, but then changed to
singles. But doesn't do much.
More information about the fpc-devel