[fpc-devel] Difficulty in specifying record alignment... and more compiler optimisation shenanigans!
J. Gareth Moreton
gareth at moreton-family.com
Tue Oct 22 05:01:46 CEST 2019
This is a long read, so strap in!
Well, I finally got it to work - the required type defintion was as follows:
{$push}
{$codealign RECORDMIN=16}
{$PACKRECORDS C}
{ This record forces "complex" to be aligned to a 16-byte boundary }
type align_dummy = record
filler: array[0..1] of real;
end;
{$pop}
type complex = record
case Byte of
0: (
alignment: align_dummy;
);
1: (
re : real;
im : real;
);
end;
It is so, so easy to get wrong because if align_dummy's field is 1, 2, 4
or 8 bytes in size, it is classed as an integer under Windows, and that
overrides the Double-type in the union, causing the entire record to
still be passed by reference. Additionally, the dummy field has to be
of type Single or Double (or Real); if it is an integral type (e.g.
"array[0..15] of Byte"), it is once again classified as an integer and
overrides the Double type as per the rules of System V ABI parameter
classification (in other words, the entire thing would get passed by
reference under both x86_64-win64 and x86_64-linux etc.). Long story
short, this is an absolute minefield!!
I still seriously think that having an alignment attribute or some such
will make life so much easier for third-party developers who may not
know the exact quirks of how x86_64 classifies its parameters. To me,
this trick feels incredibly hacky and very hard to get right.
Compiled code isnt perfect though - for example, when moving parameters
to and from the relevant xmm registers, the "movdqa" instruction is used
instead of "movapd", which causes a performance penalty because the
internal CPU state has to switch between double-precision and integer
(this is why, for example, there are separate VINSERTF128 and
VINSERTI128 instructions, even though they superficially do the same
thing). Additionally, inlined vectorcall routines still seem to fall
back onto using movq to transfer 8 bytes at a time between a function
result and wherever it is to be stored, but this is because everything
is decomposed at the node level and the compiler currently lacks any
decent vectorisation algorithms.
Nevertheless, I think I'm ready to prepare a patch for uComplex for
evaluation, and it's given me some things to play with to see if the
compiler can be made to work with packed data better. I figure the
uComplex unit is a good place to start because it's an array of 2
Doubles internally and a lot of the operations like addition are
component-wise.
Bigger challenges would be optimising the modulus of a complex number:
function cmod (z : complex): real; vectorcall;
{ module : r = |z| }
begin
with z do
cmod := sqrt((re * re) + (im * im));
end;
A perfect compiler with permission to use SSE3 (for haddpd) should
generate the following (note that no stack frame is required):
mulpd %xmm0, %xmm0 { Calculates "re * re" and "im * im" simultaneously }
haddpd %xmm0, %xmm0 { Adds the above multiplications together
(horizontal add) }
sqrtsd %xmm0
ret
Currently, with vectorcall, the routine compiles into this:
leaq -24(%rsp),%rsp
movdqa %xmm0,(%rsp)
movq %rsp,%rax
movsd (%rax),%xmm1
mulsd %xmm1,%xmm1
movsd 8(%rax),%xmm0
mulsd %xmm0,%xmm0
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0
leaq 24(%rsp),%rsp
ret
And without vectorcall (or an unaligned record type):
leaq -24(%rsp),%rsp
movq %rcx,%rax
movq (%rax),%rdx
movq %rdx,(%rsp)
movq 8(%rax),%rax
movq %rax,8(%rsp)
movq %rsp,%rax
movsd (%rax),%xmm1
mulsd %xmm1,%xmm1
movsd 8(%rax),%xmm0
mulsd %xmm0,%xmm0
addsd %xmm1,%xmm0
sqrtsd %xmm0,%xmm0
leaq 24(%rsp),%rsp
ret
Maybe I'm in the minority here, and definitely getting ahead of myself,
but seeing ways of improving the compiled assembly language excites me!
Even without vectorcall, I want to see if I can get my deep optimiser in
a workable form, because things like "movq %rsp,%rax" and then merely
reading from %rax is completely unnecessary. Also, things like this:
...
movdqa %xmm0,(%rsp)
movq %rsp,%rax
movsd (%rax),%xmm1
...
Just... why?! Just do "movsd %xmm0,%xmm1"!! The peephole optimiser may
struggle to spot this anyway because of the inefficient mixing of
integer and floating-point XMM instructions - of course, it might be the
case that the original contents of %xmm0 is needed later - this is where
my deep optimiser or some other form of data-flow analysis would come
into play. Playing the logical flow in my head, I can see it optimising
the triplet as follows:
1. Notice that %rax = %rsp and change the movsd instruction to minimise
a pipeline stall (the later "movsd 8(%rax),%xmm0" instruction would get
changed too):
...
movdqa %xmm0,(%rsp)
movq %rsp,%rax
movsd (%rsp),%xmm1
...
2. Notice that %rax is now never used, so "movq %rsp,%rax" can be safely
removed:
...
movdqa %xmm0,(%rsp)
movsd (%rsp),%xmm1
...
3. Note that what's being read from the stack is equal to %xmm0 at this
point, so just read from %xmm0 directly to prevent a pipeline stall:
...
movdqa %xmm0,(%rsp)
movsd %xmm0,%xmm1
...
It might not be able to remove the movdqa instruction because a later
instruction reads from 8(%rsp), but vectorisation improvements will help
mitigate this.
Okay, enough theorising, but I think my contagious enthusiasm is back!
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list