[fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
J. Gareth Moreton
gareth at moreton-family.com
Wed Oct 23 01:13:57 CEST 2019
That's definitely a marked improvement. Under the System V ABI and
vectorcall, both fields of a complex type would be passed through xmm0.
Splitting it up into two separate registers would require something like:
shufpd %xmm0,%xmm1,3 { Copy the high-order Double into the low-order
position - an immediate operand of "1" will also work, since we're not
concerned with the upper 64 bits of %xmm1 }
After which your complied code will work correctly (since it looks like
%xmm1 was undefined before):
mulsd %xmm0,%xmm0
mulsd %xmm1,%xmm1
addsd %xmm0,%xmm1 { In terms of register usage, the most optimal
combination of instructions here would be "addsd %xmm1,%xmm0" then
"sqrtsd %xmm0,%xmm0", since %xmm1 is released for other purposes one
instruction sooner }
sqrtsd %xmm1,%xmm0
ret
Otherwise you'd have to load in the data from reference (%rcx under
win64, and %rdi under other x86_64 platforms) - for example:
movsd (%rcx),%xmm0
movsd 8(%rcx),%xmm1
I would be interested to see the the patch when it's ready.
Under SSE2 (no horizontal add), I think the most optimal set of
instructions (assuming the entirety of the parameter is passed through
%xmm0) is:
mulpd %xmm0,%xmm0
shufpd %xmm0,%xmm1,3
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0
ret
The main motivation in my eyes is the fact that it removes one of the
multiplication instructions - mind you, on a modern processor, a pair of
"mulsd" instructions working on independent data will be executed
simultaneously, in which case the only time a cycle-counting improvement
becomes visible is if the core is hyperthreaded and another thread is
using the ALUs. Of course, a sufficiently-skilled assembler programmer
will be able to beat the compiler in many cases, but it's still a target
to strive for.
Gareth aka. Kit
On 22/10/2019 22:03, Florian Klämpfl wrote:
> Am 22.10.19 um 05:01 schrieb J. Gareth Moreton:
>
>>
>> Bigger challenges would be optimising the modulus of a complex number:
>>
>> function cmod (z : complex): real; vectorcall;
>> { module : r = |z| }
>> begin
>> with z do
>> cmod := sqrt((re * re) + (im * im));
>> end;
>>
>> A perfect compiler with permission to use SSE3 (for haddpd) should
>> generate the following (note that no stack frame is required):
>>
>> mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im"
>> simultaneously }
>> haddpd %xmm0, %xmm0 { Adds the above multiplications together
>> (horizontal add) }
>> sqrtsd %xmm0
>> ret
>>
>> Currently, with vectorcall, the routine compiles into this:
>>
>> leaq -24(%rsp),%rsp
>> movdqa %xmm0,(%rsp)
>> movq %rsp,%rax
>> movsd (%rax),%xmm1
>> mulsd %xmm1,%xmm1
>> movsd 8(%rax),%xmm0
>> mulsd %xmm0,%xmm0
>> addsd %xmm1,%xmm0
>> sqrtsd %xmm0,%xmm0
>> leaq 24(%rsp),%rsp
>> ret
>>
>> And without vectorcall (or an unaligned record type):
>>
>> leaq -24(%rsp),%rsp
>> movq %rcx,%rax
>> movq (%rax),%rdx
>> movq %rdx,(%rsp)
>> movq 8(%rax),%rax
>> movq %rax,8(%rsp)
>> movq %rsp,%rax
>> movsd (%rax),%xmm1
>> mulsd %xmm1,%xmm1
>> movsd 8(%rax),%xmm0
>> mulsd %xmm0,%xmm0
>> addsd %xmm1,%xmm0
>> sqrtsd %xmm0,%xmm0
>> leaq 24(%rsp),%rsp
>> ret
>>
>
> With a few additions (the git patch is less than 500 lines) in the
> compiler I get (it is not ready for committing yet):
>
> .section .text.n_p$program_$$_cmod$complex$$real,"ax"
> .balign 16,0x90
> .globl P$PROGRAM_$$_CMOD$COMPLEX$$REAL
> .type P$PROGRAM_$$_CMOD$COMPLEX$$REAL, at function
> P$PROGRAM_$$_CMOD$COMPLEX$$REAL:
> .Lc2:
> # Var $result located in register xmm0
> # Var z located in register xmm0
> # [test.pp]
> # [20] begin
> # [22] cmod := sqrt((re * re) + (im * im));
> mulsd %xmm0,%xmm0
> mulsd %xmm1,%xmm1
> addsd %xmm0,%xmm1
> sqrtsd %xmm1,%xmm0
> # Var $result located in register xmm0
> .Lc3:
> # [23] end;
> ret
> .Lc1:
> .Le0:
> .size P$PROGRAM_$$_CMOD$COMPLEX$$REAL, .Le0 -
> P$PROGRAM_$$_CMOD$COMPLEX$$REAL
>
> It mainly keeps records in mm registers. I am not sure about the right
> approach yet. But to allocate one register to each field of suitable
> records seems to be a reasonable approach.
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list