[fpc-devel] Question on updating FPC packages
J. Gareth Moreton
gareth at moreton-family.com
Sun Oct 27 11:50:38 CET 2019
Ideally, you should specify 'vectorcall' either when interfacing with
third-party libraries, when the code can be vectorised by the compiler,
or when doing it yourself in assembly language. For example, if I
wanted to write the cmod function in x86_64 assembler (Intel notation):
function cmod(z: Complex): Double; vectorcall; assembler; nostackframe;
asm
MULPD XMM0, XMM0
HADDPD XMM0, XMM0
SQRTSD XMM0, XMM0
end;
Without vectorcall (or an unaligned type), where each field would be in
a separate register, the code would instead be:
function cmod(z: Complex): Double; assembler; nostackframe;
asm
MULSD XMM0, XMM0
MULSD XMM1, XMM1
ADDSD XMM0, XMM1
SQRTSD XMM0, XMM0
end;
Admittedly the advantages are more obvious when using arrays of
Singles. I guess a good example would be a 4-component dot product (I
know there's a dot product instruction in SSE4, but I'm ignoring it for
now):
type
TVector4 = record
x, y, z, w: Single;
end align 16; { hey, I can dream! }
function DotProduct(V: TVector4): Single; vectorcall; assembler;
nostackframe;
begin
MULPS XMM0, XMM0
HADDPS XMM0, XMM0
HADDPS XMM0, XMM0
{ Only the first component of XMM0 is considered for the result }
end;
And without vectorcall (or an unaligned type):
function DotProduct(V: TVector4): Single; vectorcall; assembler;
nostackframe;
begin
MULSS XMM0, XMM0
MULSS XMM1, XMM1
MULSS XMM2, XMM2
MULSS XMM3, XMM3
ADDSS XMM0, XMM1
ADDSS XMM0, XMM2
ADDSS XMM0, XMM3
end;
It's hard to say which function is more efficient here due to the
latency of HADDPS and the multiple logic ports available (usually you
can do at least two independent vector multiplications simultaneously),
but the overhead of moving each field to a separate register will
definitely add up. At the very least though, for the first dot product
example, if the compiler was able to produce such assembler from Pascal
source, it would be much more efficient to inline because it only uses a
single register throughout. I'm not sure how the compiler would know to
inline a function when it's reached the assembler stage though, even if
the registers are still virtual.
To get back to the subject at hand... the advantages of vectorcall.
Microsoft Visual C++ does have a compiler option where it automatically
sets the calling convention to "vectorcall" rather than the default
Microsoft calling convention (which is based off "fastcall"), since in
most cases with integers, pointers and individual floating-point
parameters, vectorcall doesn't behave any differently. FPC would only
be able to take full advantage of vectorcall and aligned types under
Linux if the compiler was made better with vectorising instructions.
As a side-note, I would like to propose adding the "fastcall" calling
convention for i386-win32 and x86_64-win64 (and maybe other i386 and
x86_64 platforms). Under Win32, fastcall uses ECX and EDX for its first
two parameters and EAX for the result (it's a worse form of Pascal's
default 'register' convention, but this was designed in the days when
C++ functions pushed all their parameters to the stack), while under
Win64 it would be equivalent to 'ms_abi_default' and force the default
Microsoft calling convention regardless of whether there was a setting
to default to vectorcall (I consider the default calling convention to
be based off fastcall because it uses RCX and RDX for its first two
parameters, then adds R8 and R9 for the next two, and the XMM registers
for floating-point arguments). More than anything it would just help to
interface with third-party libraries again.
Gareth aka. Kit
On 27/10/2019 08:02, Florian Klämpfl wrote:
> Am 27.10.19 um 07:32 schrieb J. Gareth Moreton:
>> I guess you're right. It just seems weird because the System V ABI
>> was designed from the start to use the MM registers fully, so long as
>> the data is aligned. In effect, it had vectorcall wrapped into its
>> design from the start. Granted, vectorcall has some advantages and
>> can deal with relatively complex aggregates that the System V ABI
>> cannot handle (for example, a record type that contains a normal
>> vector and information relating to bump mapping).
>>
>> I just hoped that making updates to uComplex, while ensuring existing
>> Pascal code still compiles, would help take advantage of modern ABI
>> designs.
>
> Is there currently any example which shows that vectorcall has any
> advantage with FPC? Else I would propose first to make FPC able to
> take advantage of it and then talk about if we really add vectorcall.
> Currently I fear, FPC gets only into trouble when using vectorcall as
> it tries first to push everything into one xmm register and then
> splits this again in the callee.
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list