[fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

Wed Oct 23 01:13:57 CEST 2019

That's definitely a marked improvement.  Under the System V ABI and 
vectorcall, both fields of a complex type would be passed through xmm0.  
Splitting it up into two separate registers would require something like:

shufpd    %xmm0,%xmm1,3 { Copy the high-order Double into the low-order 
position - an immediate operand of "1" will also work, since we're not 
concerned with the upper 64 bits of %xmm1 }

After which your complied code will work correctly (since it looks like 
%xmm1 was undefined before):

mulsd    %xmm0,%xmm0
mulsd    %xmm1,%xmm1
addsd    %xmm0,%xmm1 { In terms of register usage, the most optimal 
combination of instructions here would be "addsd %xmm1,%xmm0" then 
"sqrtsd %xmm0,%xmm0", since %xmm1 is released for other purposes one 
instruction sooner }
sqrtsd    %xmm1,%xmm0
ret

Otherwise you'd have to load in the data from reference (%rcx under 
win64, and %rdi under other x86_64 platforms) - for example:

movsd    (%rcx),%xmm0
movsd    8(%rcx),%xmm1

I would be interested to see the the patch when it's ready.

Under SSE2 (no horizontal add), I think the most optimal set of 
instructions (assuming the entirety of the parameter is passed through 
%xmm0) is:

mulpd    %xmm0,%xmm0
shufpd    %xmm0,%xmm1,3
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
ret

The main motivation in my eyes is the fact that it removes one of the 
multiplication instructions - mind you, on a modern processor, a pair of 
"mulsd" instructions working on independent data will be executed 
simultaneously, in which case the only time a cycle-counting improvement 
becomes visible is if the core is hyperthreaded and another thread is 
using the ALUs.  Of course, a sufficiently-skilled assembler programmer 
will be able to beat the compiler in many cases, but it's still a target 
to strive for.

Gareth aka. Kit

On 22/10/2019 22:03, Florian Klämpfl wrote:
> Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:
>
>>
>> Bigger challenges would be optimising the modulus of a complex number:
>>
>>    function cmod (z : complex): real; vectorcall;
>>      { module : r = |z| }
>>      begin
>>         with z do
>>           cmod := sqrt((re * re) + (im * im));
>>      end;
>>
>> A perfect compiler with permission to use SSE3 (for haddpd) should 
>> generate the following (note that no stack frame is required):
>>
>> mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" 
>> simultaneously }
>> haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
>> (horizontal add) }
>> sqrtsd    %xmm0
>> ret
>>
>> Currently, with vectorcall, the routine compiles into this:
>>
>> leaq    -24(%rsp),%rsp
>> movdqa    %xmm0,(%rsp)
>> movq    %rsp,%rax
>> movsd    (%rax),%xmm1
>> mulsd    %xmm1,%xmm1
>> movsd    8(%rax),%xmm0
>> mulsd    %xmm0,%xmm0
>> addsd    %xmm1,%xmm0
>> sqrtsd    %xmm0,%xmm0
>> leaq    24(%rsp),%rsp
>> ret
>>
>> And without vectorcall (or an unaligned record type):
>>
>> leaq    -24(%rsp),%rsp
>> movq    %rcx,%rax
>> movq    (%rax),%rdx
>> movq    %rdx,(%rsp)
>> movq    8(%rax),%rax
>> movq    %rax,8(%rsp)
>> movq    %rsp,%rax
>> movsd    (%rax),%xmm1
>> mulsd    %xmm1,%xmm1
>> movsd    8(%rax),%xmm0
>> mulsd    %xmm0,%xmm0
>> addsd    %xmm1,%xmm0
>> sqrtsd    %xmm0,%xmm0
>> leaq    24(%rsp),%rsp
>> ret
>>
>
> With a few additions (the git patch is less than 500 lines) in the 
> compiler I get (it is not ready for committing yet):
>
> .section .text.n_p$program_$$_cmod$complex$$real,"ax"
>     .balign 16,0x90
> .globl    P$PROGRAM_$$_CMOD$COMPLEX$$REAL
>     .type    P$PROGRAM_$$_CMOD$COMPLEX$$REAL, at function
> P$PROGRAM_$$_CMOD$COMPLEX$$REAL:
> .Lc2:
> # Var $result located in register xmm0
> # Var z located in register xmm0
> # [test.pp]
> # [20] begin
> # [22] cmod := sqrt((re * re) + (im * im));
>     mulsd    %xmm0,%xmm0
>     mulsd    %xmm1,%xmm1
>     addsd    %xmm0,%xmm1
>     sqrtsd    %xmm1,%xmm0
> # Var $result located in register xmm0
> .Lc3:
> # [23] end;
>     ret
> .Lc1:
> .Le0:
>     .size    P$PROGRAM_$$_CMOD$COMPLEX$$REAL, .Le0 - 
> P$PROGRAM_$$_CMOD$COMPLEX$$REAL
>
> It mainly keeps records in mm registers. I am not sure about the right 
> approach yet. But to allocate one register to each field of suitable 
> records seems to be a reasonable approach.
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus