[fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (not sure about x86-64 on Linux)

J. Gareth Moreton gareth at moreton-family.com
Wed Nov 1 05:58:18 CET 2017


So I've been doing some playing around recently, and noticed that while FillChar has some very fast internal 
code for initialising a block of memory, making use of non-temporal hints and memory fences, the versions 
for the larger types fall back to slow Pascal code.  To showcase this, I ran a test on my 6-year-old laptop 
that compared a small and slightly basic assembler routine against the internal functions (times are 
averaged over 100 iterations):

FillWord - initialise 16,777,216 words to 0

- Internal: 8177.209 µs
- Assembler: 4234.131 µs

FillWord - initialise 1,048,576 words to $AAAA

- Internal: 153.221 µs
- Assembler: 86.496 µs

FillWord - initialise 1,229 words to $5555

- Internal: 0.267 µs
- Assembler: 0.135 µs

FillDWord - initialise 16,777,216 DWords to 0

- Internal: 15552.032 µs
- Assembler: 10945.809 µs

FillDWord - initialise 1,048,576 DWords to $AAAAAAAA

- Internal: 902.060 µs
- Assembler: 470.788 µs

FillDWord - initialise 1,229 DWords to $55555555

- Internal: 0.357 µs
- Assembler: 0.275 µs

FillQWord - initialise 16,777,216 QWords to 0

- Internal: 33397.248 µs
- Assembler: 17488.901 µs

FillQWord - initialise 1,048,576 QWords to $AAAAAAAAAAAAAAAA

- Internal: 2130.116 µs
- Assembler: 1258.130 µs

FillQWord - initialise 1,229 QWords to $5555555555555555

- Internal: 0.739 µs
- Assembler: 0.402 µs


The assembler functions were as follows:
    {$ASMMODE INTEL}

    procedure SizeOptimisedFillWord(var x; count: SizeInt; Value: Word); assembler; nostackframe;
    asm
      { RCX = Pointer to x
        RDX = Count
        R8W = Value }
      PUSH RDI
      MOV  AX,  R8W
      MOV  RDI, RCX
      MOV  RCX, RDX
      REP  STOSW
      POP  RDI
    end;

    procedure SizeOptimisedFillDWord(var x; count: SizeInt; Value: DWord); assembler; nostackframe;
    asm
      { RCX = Pointer to x
        RDX = Count
        R8D = Value }
      PUSH RDI
      MOV  EAX, R8D
      MOV  RDI, RCX
      MOV  RCX, RDX
      REP  STOSD
      POP  RDI
    end;

    procedure SizeOptimisedFillQWord(var x; count: SizeInt; Value: QWord); assembler; nostackframe;
    asm
      { RCX = Pointer to x
        RDX = Count
        R8  = Value }
      PUSH RDI
      MOV  RAX, R8
      MOV  RDI, RCX
      MOV  RCX, RDX
      REP  STOSQ
      POP  RDI
    end;


I also made versions that use memory fences and other checks such as memory alignment in order to gain speed 
- I've converted them to use the System V ABI of Linux as well, but are currently completely untested as I 
don't have the facilities to yet compile on Linux (they are also even smaller in code size because you don't 
need to push and pop RDI, and the destination (var x) is already stored in RDI, thereby collapsing each 
routine to just 3 instructions (not including the REP prefix)).

Would it be worth opening up a bug report for this, with the attached assembler routines as suggestions? I 
haven't worked out how to implement internal functions into the compiler yet, and I rather clear it with you 
guys first before I make such an addition.  I had a thought that the simple routines above could be used for 
when compiling for small code size, while larger, more advanced ones are used for when compiling for speed.

Yours faithfully,

J. Gareth "Kit" Moreton



More information about the fpc-devel mailing list