[fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (notsure about x86-64 on Linux)
J. Gareth Moreton
gareth at moreton-family.com
Wed Nov 1 18:23:28 CET 2017
Thanks for the feedback everyone. I wasn't sure about internal functions because FillWord, for example, is
surrounded by "{$ifndef FPC_SYSTEM_HAS_FILLWORD}", which isn't defined under Win64, whereas
FPC_SYSTEM_HAS_FILLCHAR is defined and the implementation of FillChar is nowhere to be found when you try to
search for it in Lazarus.
For the speed-optimised assembler routines, I have the following (which does borrow ideas from FillChar):
procedure SpeedOptimisedFillWord(var x; count: SizeInt; Value: Word); assembler; nostackframe;
asm
{ RCX = Pointer to x
RDX = Count
R8W = Value }
PUSH RDI
MOVZX RAX, R8W
MOV R9, $0001000100010001
MOV RDI, RCX
IMUL RAX, R9
{ Do some memory alignment first (it should be at least aligned to a 16-bit boundary already) }
AND CL, $6
JZ @Aligned8
TEST CL, $2
JZ @Aligned4
MOV [RDI], R8W
DEC RDX
ADD RDI, $2
TEST CL, $4
JNZ @Aligned8 { Note that it's NOT zero here, because if TEST CL, $4 sets ZF here, then the memory
block was originally 2 bytes away from the boundary }
@Aligned4:
MOV [RDI], EAX
SUB RDX, $2
ADD RDI, $4
@Aligned8:
MOV R10B,DL
SHR RDX, 2
AND R10B,$3
MOV RCX, RDX
CMP RDX, $80000
JB @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ }
SHR RDX, 2
AND RCX, $3
{ Write 32 bytes at a time using a non-temporal hint }
@BlockLoop:
ADD RDI, $20
MOVNTI [RDI-$20], RAX
MOVNTI [RDI-$18], RAX
DEC RDX
MOVNTI [RDI-$10], RAX
MOVNTI [RDI-$8], RAX
JNZ @BlockLoop
MFENCE
@NoBlocks:
SHR R10B, 1
REP STOSQ
JNC @NoLooseWord
MOV [RDI], R8W
LEA RDI, [RDI+2]
@NoLooseWord:
JZ @NoLooseDWord
MOV [RDI], EAX
@NoLooseDWord:
POP RDI
end;
procedure SpeedOptimisedFillDWord(var x; count: SizeInt; Value: DWord); assembler; nostackframe;
asm
{ RCX = Pointer to x
RDX = Count
R8W = Value }
PUSH RDI
MOV RAX, R8
MOV RDI, RCX
SHL RAX, 32
OR RAX, R8
{ Do some memory alignment first (it should be at least aligned to a 32-bit boundary already) }
AND CL, $4
JZ @Aligned8
MOV [RDI], R8D
DEC RDX
ADD RDI, $4
@Aligned8:
SHR RDX, 1
SETC R10B
MOV RCX, RDX
CMP RDX, $80000
JB @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ }
SHR RDX, 2
AND RCX, $3
{ Write 32 bytes at a time using a non-temporal hint }
@BlockLoop:
ADD RDI, $20
MOVNTI [RDI-$20], RAX
MOVNTI [RDI-$18], RAX
DEC RDX
MOVNTI [RDI-$10], RAX
MOVNTI [RDI-$8], RAX
JNZ @BlockLoop
MFENCE
@NoBlocks:
TEST R10B, R10B
REP STOSQ
JZ @NoLooseDWord
MOV [RDI], EAX
@NoLooseDWord:
POP RDI
end;
procedure SpeedOptimisedFillQWord(var x; count: SizeInt; Value: QWord); assembler; nostackframe;
asm
{ RCX = Pointer to x
RDX = Count
R8 = Value }
PUSH RDI
CMP RDX, $80000
MOV RDI, RCX
MOV RCX, RDX
JB @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ }
AND RCX, $3
SHR RDX, 2
JZ @NoBlocks
{ Write 32 bytes at a time using a non-temporal hint }
@BlockLoop:
ADD RDI, $20
MOVNTI [RDI-$20], R8
MOVNTI [RDI-$18], R8
DEC RDX
MOVNTI [RDI-$10], R8
MOVNTI [RDI-$8], R8
JNZ @BlockLoop
MFENCE
@NoBlocks:
MOV RAX, R8
REP STOSQ
POP RDI
end;
Regarding the CFI annotations, these functions are actually even better under Linux x64 because RDI is
volatile and doesn't need to be pushed and popped, and those operations were the only things that modified
the stack pointer... and since the above routines don't call any other procedures, we can use "nostackframe"
safely.
I am tempted to experiment a little further, because one thing that's guaranteed to be present under x64 is
SSE2, so it may be possible to increase the speed even more, although at the same time there may be a
performance penalty if the rest of the application uses AVX or floating-point SSE.
J. Gareth "Kit" Moreton
On Wed 01/11/17 11:03 , Sergei Gorelkin via fpc-devel fpc-devel at lists.freepascal.org sent:
>
>
>
>
> 01.11.2017 10:46, Sven Barth via fpc-devel wrote:
>
> > Am 01.11.2017 05:58 schrieb "J. Gareth
> Moreton" eth at moreton-family.com
> > eth at moreton-family.com>>:
> >
>
> > Would it be worth opening up a bug report
> for this, with the attached assembler routines as
> > suggestions? I
>
> > haven't worked out how to implement internal
> functions into the compiler yet, and I rather clear
> > it with you
>
> > guys first before I make such an
> addition. I had a thought that the simple routines above could
> > be used for
>
> > when compiling for small code size, while
> larger, more advanced ones are used for when compiling
> > for speed.
>
> >
>
> >
>
> > Improvements like these are always welcome. Two
> points however:
> > The Fill* routines are not part of the compiler,
> but of the RTL (the Pascal routines are in
> > rtl/inc/generic.inc, the assembly ones reside in
> rtl/CPU/CPU.inc) and they aren't handled
> > differently depending on the current
> optimization flags, so a one-size-fits-all is needed (look at
> > e.g. the i386 ones).
>
> > I also think that you might need to handle
> memory that isn't correctly aligned for the assembler
> > instructions (I didn't look at your routines in
> detail so I don't know whether they'd need to be
> > adjusted for that). A check of the i386 routines
> will probably help here as well.
> >
>
>
>
> Another important thing to note is that all modifications to stack pointer
> and nonvolatile registers
> on x86_64 need SEH annotations in win64 and CFI annotations on linux/bsd.
> The former is available
> only in AT&T syntax, the latter is not supported.
>
> This requierment, together with different parameter locations, makes
> writing assembler routines for
> x86_64 much more complicated than for i386. For this reason, existing
> assembler routines in RTL
> avoid using nonvolatile registers as much as possible.
>
> Regards,
>
> Sergei
>
> _______________________________________________
>
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
>
>
>
More information about the fpc-devel
mailing list