[fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!

Tue Oct 22 05:01:46 CEST 2019

This is a long read, so strap in!

Well, I finally got it to work - the required type defintion was as follows:

{$push}
{$codealign RECORDMIN=16}
{$PACKRECORDS C}
   { This record forces "complex" to be aligned to a 16-byte boundary }
   type align_dummy = record
     filler: array[0..1] of real;
   end;
{$pop}

   type complex = record
                    case Byte of
                    0: (
                         alignment: align_dummy;
                       );
                    1: (
                         re : real;
                         im : real;
                       );
                  end;

It is so, so easy to get wrong because if align_dummy's field is 1, 2, 4 
or 8 bytes in size, it is classed as an integer under Windows, and that 
overrides the Double-type in the union, causing the entire record to 
still be passed by reference.  Additionally, the dummy field has to be 
of type Single or Double (or Real); if it is an integral type (e.g. 
"array[0..15] of Byte"), it is once again classified as an integer and 
overrides the Double type as per the rules of System V ABI parameter 
classification (in other words, the entire thing would get passed by 
reference under both x86_64-win64 and x86_64-linux etc.).  Long story 
short, this is an absolute minefield!!

I still seriously think that having an alignment attribute or some such 
will make life so much easier for third-party developers who may not 
know the exact quirks of how x86_64 classifies its parameters.  To me, 
this trick feels incredibly hacky and very hard to get right.

Compiled code isnt perfect though - for example, when moving parameters 
to and from the relevant xmm registers, the "movdqa" instruction is used 
instead of "movapd", which causes a performance penalty because the 
internal CPU state has to switch between double-precision and integer 
(this is why, for example, there are separate VINSERTF128 and 
VINSERTI128 instructions, even though they superficially do the same 
thing).  Additionally, inlined vectorcall routines still seem to fall 
back onto using movq to transfer 8 bytes at a time between a function 
result and wherever it is to be stored, but this is because everything 
is decomposed at the node level and the compiler currently lacks any 
decent vectorisation algorithms.

Nevertheless, I think I'm ready to prepare a patch for uComplex for 
evaluation, and it's given me some things to play with to see if the 
compiler can be made to work with packed data better.  I figure the 
uComplex unit is a good place to start because it's an array of 2 
Doubles internally and a lot of the operations like addition are 
component-wise.

Bigger challenges would be optimising the modulus of a complex number:

   function cmod (z : complex): real; vectorcall;
     { module : r = |z| }
     begin
        with z do
          cmod := sqrt((re * re) + (im * im));
     end;

A perfect compiler with permission to use SSE3 (for haddpd) should 
generate the following (note that no stack frame is required):

mulpd    %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously }
haddpd    %xmm0, %xmm0 { Adds the above multiplications together 
(horizontal add) }
sqrtsd    %xmm0
ret

Currently, with vectorcall, the routine compiles into this:

leaq    -24(%rsp),%rsp
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

And without vectorcall (or an unaligned record type):

leaq    -24(%rsp),%rsp
movq    %rcx,%rax
movq    (%rax),%rdx
movq    %rdx,(%rsp)
movq    8(%rax),%rax
movq    %rax,8(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
mulsd    %xmm1,%xmm1
movsd    8(%rax),%xmm0
mulsd    %xmm0,%xmm0
addsd    %xmm1,%xmm0
sqrtsd    %xmm0,%xmm0
leaq    24(%rsp),%rsp
ret

Maybe I'm in the minority here, and definitely getting ahead of myself, 
but seeing ways of improving the compiled assembly language excites me!  
Even without vectorcall, I want to see if I can get my deep optimiser in 
a workable form, because things like "movq %rsp,%rax" and then merely 
reading from %rax is completely unnecessary.  Also, things like this:

...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rax),%xmm1
...

Just... why?!  Just do "movsd %xmm0,%xmm1"!! The peephole optimiser may 
struggle to spot this anyway because of the inefficient mixing of 
integer and floating-point XMM instructions - of course, it might be the 
case that the original contents of %xmm0 is needed later - this is where 
my deep optimiser or some other form of data-flow analysis would come 
into play.  Playing the logical flow in my head, I can see it optimising 
the triplet as follows:

1. Notice that %rax = %rsp and change the movsd instruction to minimise 
a pipeline stall (the later "movsd 8(%rax),%xmm0" instruction would get 
changed too):

...
movdqa    %xmm0,(%rsp)
movq    %rsp,%rax
movsd    (%rsp),%xmm1
...

2. Notice that %rax is now never used, so "movq %rsp,%rax" can be safely 
removed:

...
movdqa    %xmm0,(%rsp)
movsd    (%rsp),%xmm1
...

3. Note that what's being read from the stack is equal to %xmm0 at this 
point, so just read from %xmm0 directly to prevent a pipeline stall:

...
movdqa    %xmm0,(%rsp)
movsd    %xmm0,%xmm1
...

It might not be able to remove the movdqa instruction because a later 
instruction reads from 8(%rsp), but vectorisation improvements will help 
mitigate this.

Okay, enough theorising, but I think my contagious enthusiasm is back!

Gareth aka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus