[fpc-devel] Question on updating FPC packages

Sun Oct 27 11:50:38 CET 2019

Ideally, you should specify 'vectorcall' either when interfacing with 
third-party libraries, when the code can be vectorised by the compiler, 
or when doing it yourself in assembly language.  For example, if I 
wanted to write the cmod function in x86_64 assembler (Intel notation):

function cmod(z: Complex): Double; vectorcall; assembler; nostackframe;
asm
   MULPD XMM0, XMM0
   HADDPD XMM0, XMM0
   SQRTSD XMM0, XMM0
end;

Without vectorcall (or an unaligned type), where each field would be in 
a separate register, the code would instead be:

function cmod(z: Complex): Double; assembler; nostackframe;
asm
   MULSD XMM0, XMM0
   MULSD XMM1, XMM1
   ADDSD XMM0, XMM1
   SQRTSD XMM0, XMM0
end;

Admittedly the advantages are more obvious when using arrays of 
Singles.  I guess a good example would be a 4-component dot product (I 
know there's a dot product instruction in SSE4, but I'm ignoring it for 
now):

type
   TVector4 = record
     x, y, z, w: Single;
   end align 16; { hey, I can dream! }

function DotProduct(V: TVector4): Single; vectorcall; assembler; 
nostackframe;
begin
   MULPS XMM0, XMM0
   HADDPS XMM0, XMM0
   HADDPS XMM0, XMM0
   { Only the first component of XMM0 is considered for the result }
end;

And without vectorcall (or an unaligned type):

function DotProduct(V: TVector4): Single; vectorcall; assembler; 
nostackframe;
begin
   MULSS XMM0, XMM0
   MULSS XMM1, XMM1
   MULSS XMM2, XMM2
   MULSS XMM3, XMM3
   ADDSS XMM0, XMM1
   ADDSS XMM0, XMM2
   ADDSS XMM0, XMM3
end;

It's hard to say which function is more efficient here due to the 
latency of HADDPS and the multiple logic ports available (usually you 
can do at least two independent vector multiplications simultaneously), 
but the overhead of moving each field to a separate register will 
definitely add up.  At the very least though, for the first dot product 
example, if the compiler was able to produce such assembler from Pascal 
source, it would be much more efficient to inline because it only uses a 
single register throughout.  I'm not sure how the compiler would know to 
inline a function when it's reached the assembler stage though, even if 
the registers are still virtual.

To get back to the subject at hand... the advantages of vectorcall.  
Microsoft Visual C++ does have a compiler option where it automatically 
sets the calling convention to "vectorcall" rather than the default 
Microsoft calling convention (which is based off "fastcall"), since in 
most cases with integers, pointers and individual floating-point 
parameters, vectorcall doesn't behave any differently.  FPC would only 
be able to take full advantage of vectorcall and aligned types under 
Linux if the compiler was made better with vectorising instructions.

As a side-note, I would like to propose adding the "fastcall" calling 
convention for i386-win32 and x86_64-win64 (and maybe other i386 and 
x86_64 platforms).  Under Win32, fastcall uses ECX and EDX for its first 
two parameters and EAX for the result (it's a worse form of Pascal's 
default 'register' convention, but this was designed in the days when  
C++ functions pushed all their parameters to the stack), while under 
Win64 it would be equivalent to 'ms_abi_default' and force the default 
Microsoft calling convention regardless of whether there was a setting 
to default to vectorcall (I consider the default calling convention to 
be based off fastcall because it uses RCX and RDX for its first two 
parameters, then adds R8 and R9 for the next two, and the XMM registers 
for floating-point arguments).  More than anything it would just help to 
interface with third-party libraries again.

Gareth aka. Kit

On 27/10/2019 08:02, Florian Klämpfl wrote:

> Am 27.10.19 um 07:32 schrieb J. Gareth Moreton:
>> I guess you're right.  It just seems weird because the System V ABI 
>> was designed from the start to use the MM registers fully, so long as 
>> the data is aligned.  In effect, it had vectorcall wrapped into its 
>> design from the start. Granted, vectorcall has some advantages and 
>> can deal with relatively complex aggregates that the System V ABI 
>> cannot handle (for example, a record type that contains a normal 
>> vector and information relating to bump mapping).
>>
>> I just hoped that making updates to uComplex, while ensuring existing 
>> Pascal code still compiles, would help take advantage of modern ABI 
>> designs.
>
> Is there currently any example which shows that vectorcall has any 
> advantage with FPC? Else I would propose first to make FPC able to 
> take advantage of it and then talk about if we really add vectorcall. 
> Currently I fear, FPC gets only into trouble when using vectorcall as 
> it tries first to push everything into one xmm register and then 
> splits this again in the callee.
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus