[fpc-devel] Finer control of byte alignment

Sat Dec 30 22:20:49 CET 2017

Hi everyone,

I can't remember if this has been brought up in the fpc-devel mailing list yet, but as discussed under 
Feature Ideas ( http://wiki.lazarus.freepascal.org/Feature_Ideas#Per-type_Byte_Alignment ) and in bug number 
32780 ( https://bugs.freepascal.org/view.php?id=32780 ), I would like to propose the ability to more finely 
control the byte alignment of variables on a per-type basis.  While this can be controlled to an extent with 
compiler directives, it is somewhat messy and might cause conflicts in some situations, especially where 
third-party modules are concerned.  Such a feature would be extremely useful for the Intel SSE and AVX 
extensions, where reading and writing to memory that isn't aligned to a 16-byte boundary incurs a 
performance penalty, and as hinted in the bug report, would allow for C-style inline intrinsics to be fully 
supported (at least on Linux - Free Pascal will need support of "vectorcall" on Windows as well).

Apparently Delphi supports such a feature, but is not well-documented (can someone confirm this?).  For 
syntax, I would recommend what Delphi apparently uses, which is to append "align #" after the type 
definition, with # being a power of 2.

type
  AlignedSingle = Single align 16; { Single is aligned to at least a 4-byte boundary because of its size, 
but an AlignedSingle will be on a 16-byte boundary }

  AlignedDouble = Double align 16; { Double is aligned to at least an 8-byte boundary because of its size, 
but an AlignedDouble will be on a 16-byte boundary }

  M128 = packed record
    case Integer of
    0: (Scalar: Single);
    1: (X, Y, Z, W: Single);
    2: (E: array[0..3] of Single);
  end align 16;

  TVector4f = M128; { Is also aligned to a 16-byte boundary because M128 is }

  TVectorArray = array of M128 align 32; { Tighter restrictions than M128, so should be fine, although this 
should probably align Var[0] to the 32-byte boundary rather than Var itself. One would use such a definition 
if passing the data into YMM registers, which require alignment to a 32-byte boundary, but where there may 
be an odd number of vectors }

There are some nuances to consider, namely where typecasting is concerned (while typecasting from an aligned 
to an unaligned type is fine, what about the reverse? And let's not get started with pointers to such 
types!), and since this talks about Intel x86 and x86-64 in particular, what happens if it's used on another 
platform?  Are there other platforms that support memory alignment, and if not, should a warning or error be 
raised for the appearance of 'align', or should it be ignored?

Just as an example of an intrinsic (on 64-bit Linux) so one can take full advantage of SSE and AVX (because 
there will always be cases where the compiler won't create the best machine code no matter how good it is):

function _sse_addps(Input1, Input2: M128): M128; assembler; nostackframe; inline;
asm
  ADDPS XMM0, XMM1 { Intel syntax - AT&T would be "addps %xmm1,%xmm0" }
end;

Mind you, when it comes to inlining these intrinsics, it depends on how smart the compiler is in assigning 
variables to the XMM registers and switching them around in the intrinsic subroutines.  For Windows, 
"vectorcall" is required (see https://bugs.freepascal.org/view.php?id=32781 and 
http://wiki.lazarus.freepascal.org/Feature_Ideas#.22vectorcall.3B.22_modifier_for_Win32_and_Win64 for more 
information) because the standard Microsoft calling convention does not properly take advantage of MM 
registers.  For 32-bit Linux I'm not sure what would be the best approach, except to perhaps adopt 
Microsoft's "vectorcall" just because there doesn't seem to be another appropriate 32-bit standard.

Gareth aka. Kit