[fpc-devel] Difficulty in specifying record alignment

Mon Oct 21 21:00:49 CEST 2019

Am 21.10.19 um 00:57 schrieb J. Gareth Moreton:
> Hi everyone,
> 
> I'm trying to make some optimisation improvements to UComplex so the 
> compiler can take advantage of SSE2 or AVX features without needing to 
> write specialised code (other than using the "vectorcall" directive 
> under Win64).  I am having some difficulty though.
> 
> The record type "complex" is defined as follows:
> 
> *type *complex = *record*
>                       re : real;
>                       im : real;
> *end*;
> 
> (Real is equivalent to Double on x86_64)
> 
> This also corresponds with how a complex number is defined for Extended 
> Pascal.  Currently, when compiled under x86_64-win64, the fields are 
> placed on 8-byte boundaries, but because the type as a whole is also on 
> an 8-byte boundary (not 16-byte), the compiler cannot take advantage of 
> the XMM registers when passing such a construct as a parameter or return 
> value, and hence has to pass it by reference.  For high-speed scientific 
> programming, this quickly adds up to a notable penalty.  For example, 
> the compiled assembly language for adding together two complex numbers 
> on x86_64-win64 ("Z := Z + X;"):
> 
>      movsd    U_$P$COMPLEX_$$_Z(%rip),%xmm0
>      addsd    U_$P$COMPLEX_$$_X(%rip),%xmm0
>      movsd    %xmm0,40(%rsp)
>      movsd    U_$P$COMPLEX_$$_Z+8(%rip),%xmm0
>      addsd    U_$P$COMPLEX_$$_X+8(%rip),%xmm0
>      movsd    %xmm0,48(%rsp)
>      movq    40(%rsp),%rax
>      movq    %rax,U_$P$COMPLEX_$$_Z(%rip)
>      movq    48(%rsp),%rax
>      movq    %rax,U_$P$COMPLEX_$$_Z+8(%rip)
> 
> Even if the reads and writes to memory cannot be removed, treating the 
> complex data type as an aligned array of doubles should be able to yield 
> far more efficient code (might require some compiler quirks so it 
> detects the component-wise addition in the inlined + operator for the 
> complex type):
> 
>      movapd   U_$P$COMPLEX_$$_Z(%rip),%xmm0
>      addpd    U_$P$COMPLEX_$$_X(%rip),%xmm0
>      movapd   %xmm0,U_$P$COMPLEX_$$_Z(%rip)
> 
> The problem here is that there's no practical way to force the entire 
> record's alignment onto a 16-byte boundary (a requirement for 
> "vectorcall") without also snapping each individual field to such a 
> boundary.  Strictly speaking, I don't think the 16-byte boundary is a 
> requirement for the System V ABI (the Unix calling convention for 64-bit 
> Intel processors), 

The stack is 16 byte aligned, aligning data is up to the compiler.

> and there are unaligned move instructions to 
> accommodate for this (which have traditionally been slightly slower than 
> the aligned counterparts), but currently the Free Pascal Compiler 
> demands the alignment, mainly because of shared compiler code between 
> Windows and non-Windows builds.

Each target can have its own aligment requirements.

> 
> The only way to enforce a construct where the record is on a 16-byte 
> boundary but the two 8-byte fields are packed is to use an array 
> element; e.g:
> 
>    {$push}
>    {$codealign RECORDMIN=16}
> *type* complex = *record*
>                       part: *array*[0..1] of real;
> *end*;
>    {$pop}
> 
> Mapping "re" to "part[0]" and "im" to "part[1]" using a union is 
> impossible because "im" will be put on the next 16-byte boundary and be 
> its own separate entity.  Other constructs such as nested unions are 
> possible, but this will break backward compatibility with code that uses 
> the uComplex unit.
> 
> A while ago I requested a means to specify an alignment on a per-type 
> basis so it is easier for third-party programmers to take advantage of 
> the extra efficiency brought upon by vectorcall and the System V ABI: 
> https://bugs.freepascal.org/view.php?id=32780 - this effectively boils 
> down to being able to define something akin to the following:
> 
> *type *complex = *record*
>                       re : real;
>                       im : real;
> *end*/{$ifdef CPUX86_64}/ *align* 16/{$endif CPUX86_64}/;
> 
> It was assigned to Maciej last year, but hasn't seen any progress since.
> 
> If not that alignment feature, is there any other way to cleanly enforce 
> a 16-byte boundary for such a packed type without having to completely 
> redesign it to the point that it breaks compatibility?

What's the problem with

{$push}
{$codealign RECORDMIN=16}
type complex = record
                       re : real;
                       im : real;
end;
{$pop}

?