[fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

Wed Oct 23 23:06:46 CEST 2019

Hmmm, that is unfortunate if the horizontal operations are inefficient.  
I had a look at them at 
https://www.agner.org/optimize/instruction_tables.pdf - you are right in 
that HADDPS has a surprisingly high latency (approximately how many 
cycles it takes to execute), although HADDPD isn't as bad, probably 
because it's only dealing with 2 Doubles instead of 4 Singles, and it 
seems mostly equivalent in speed to the multiplication instructions.

Using just SSE2:

mulpd %xmm0,%xmm0
shufpd %xmm0,%xmm1,1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

Ultimately it's not much better than what you have:

shufpd %xmm0,%xmm1,1 { Only needed if both fields are in %xmm0 }
mulsd %xmm0,%xmm0
mulsd %xmm1,%xmm1
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0

If you measure the dependencies between the instructions (shufpd and the 
first mulsd can run simultaneously, or equivalently, the two mulsd 
instructions), it still amounts to 4 cycles, assuming each instruction 
takes an equal amount of time to execute (which they don't, but it's a 
reasonable approximation).  The subroutines are also probably too small 
to get accurate timing metrics on them.  It might be something to 
experiment on though - I would hope at the very least that the 
horizontal operations have improved in later years.

I know though that vectorising instructions is, by and large, a net 
gain.  For example, let's go to a simpler example of adding two complex 
numbers together:

   operator + (z1, z2 : complex) z : complex; vectorcall;
   {$ifdef TEST_INLINE}
   inline;
   {$endif TEST_INLINE}
     { addition : z := z1 + z2 }
     begin
        z.re := z1.re + z2.re;
        z.im := z1.im + z2.im;
     end;

No horizonal adds here, just a simple packed addition and storing the 
result into %xmm0 as opposed to two scalar additions and then combining 
the result in whatever way is demanded (if aligned, it's all in %xmm0, 
if unaligned, I think %xmm0 and %xmm1 are supposed to be used).  Mind 
you, in this case the function is inlined, so the parameter passing 
doesn't always apply.

Once again though, I was surprised at how inefficient HADDPS is once you 
pointed it out.  The double-precision versions aren't nearly as bad 
though, so maybe they can still be used.

Gareth aka. Kit

P.S. As far as 128-bit aligned vector types are concerned, vectorcall 
and the System V ABI can be considered equivalent. Vectorcall can use 
more MM registers for return values and more complex aggregates as 
parameters, but in our examples, we don't have to worry about that yet.

On 23/10/2019 21:20, Florian Klämpfl wrote:
> Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:
>>
>> mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" 
>> simultaneously }
>> haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
>> (horizontal add) }
>
> Unfortunatly, those horizontal operations are normally not very 
> efficient IIRC.
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus