[fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)
J. Gareth Moreton
gareth at moreton-family.com
Wed Nov 1 05:58:18 CET 2017
So I've been doing some playing around recently, and noticed that while FillChar has some very fast internal
code for initialising a block of memory, making use of non-temporal hints and memory fences, the versions
for the larger types fall back to slow Pascal code. To showcase this, I ran a test on my 6-year-old laptop
that compared a small and slightly basic assembler routine against the internal functions (times are
averaged over 100 iterations):
FillWord - initialise 16,777,216 words to 0
- Internal: 8177.209 µs
- Assembler: 4234.131 µs
FillWord - initialise 1,048,576 words to $AAAA
- Internal: 153.221 µs
- Assembler: 86.496 µs
FillWord - initialise 1,229 words to $5555
- Internal: 0.267 µs
- Assembler: 0.135 µs
FillDWord - initialise 16,777,216 DWords to 0
- Internal: 15552.032 µs
- Assembler: 10945.809 µs
FillDWord - initialise 1,048,576 DWords to $AAAAAAAA
- Internal: 902.060 µs
- Assembler: 470.788 µs
FillDWord - initialise 1,229 DWords to $55555555
- Internal: 0.357 µs
- Assembler: 0.275 µs
FillQWord - initialise 16,777,216 QWords to 0
- Internal: 33397.248 µs
- Assembler: 17488.901 µs
FillQWord - initialise 1,048,576 QWords to $AAAAAAAAAAAAAAAA
- Internal: 2130.116 µs
- Assembler: 1258.130 µs
FillQWord - initialise 1,229 QWords to $5555555555555555
- Internal: 0.739 µs
- Assembler: 0.402 µs
The assembler functions were as follows:
{$ASMMODE INTEL}
procedure SizeOptimisedFillWord(var x; count: SizeInt; Value: Word); assembler; nostackframe;
asm
{ RCX = Pointer to x
RDX = Count
R8W = Value }
PUSH RDI
MOV AX, R8W
MOV RDI, RCX
MOV RCX, RDX
REP STOSW
POP RDI
end;
procedure SizeOptimisedFillDWord(var x; count: SizeInt; Value: DWord); assembler; nostackframe;
asm
{ RCX = Pointer to x
RDX = Count
R8D = Value }
PUSH RDI
MOV EAX, R8D
MOV RDI, RCX
MOV RCX, RDX
REP STOSD
POP RDI
end;
procedure SizeOptimisedFillQWord(var x; count: SizeInt; Value: QWord); assembler; nostackframe;
asm
{ RCX = Pointer to x
RDX = Count
R8 = Value }
PUSH RDI
MOV RAX, R8
MOV RDI, RCX
MOV RCX, RDX
REP STOSQ
POP RDI
end;
I also made versions that use memory fences and other checks such as memory alignment in order to gain speed
- I've converted them to use the System V ABI of Linux as well, but are currently completely untested as I
don't have the facilities to yet compile on Linux (they are also even smaller in code size because you don't
need to push and pop RDI, and the destination (var x) is already stored in RDI, thereby collapsing each
routine to just 3 instructions (not including the REP prefix)).
Would it be worth opening up a bug report for this, with the attached assembler routines as suggestions? I
haven't worked out how to implement internal functions into the compiler yet, and I rather clear it with you
guys first before I make such an addition. I had a thought that the simple routines above could be used for
when compiling for small code size, while larger, more advanced ones are used for when compiling for speed.
Yours faithfully,
J. Gareth "Kit" Moreton
More information about the fpc-devel
mailing list