[fpc-devel] FillWord, FillDWord and FillQWord are very poorly optimised on Win64 (notsure about x86-64 on Linux)

Wed Nov 1 18:23:28 CET 2017

Thanks for the feedback everyone.  I wasn't sure about internal functions because FillWord, for example, is 
surrounded by "{$ifndef FPC_SYSTEM_HAS_FILLWORD}", which isn't defined under Win64, whereas 
FPC_SYSTEM_HAS_FILLCHAR is defined and the implementation of FillChar is nowhere to be found when you try to 
search for it in Lazarus.

For the speed-optimised assembler routines, I have the following (which does borrow ideas from FillChar):

    procedure SpeedOptimisedFillWord(var x; count: SizeInt; Value: Word); assembler; nostackframe;
    asm
      { RCX = Pointer to x
        RDX = Count
        R8W = Value }
      PUSH RDI
      MOVZX RAX, R8W
      MOV  R9,  $0001000100010001
      MOV  RDI, RCX
      IMUL RAX, R9

      { Do some memory alignment first (it should be at least aligned to a 16-bit boundary already) }
      AND  CL,  $6
      JZ   @Aligned8
      TEST CL,  $2
      JZ   @Aligned4
      MOV  [RDI], R8W
      DEC  RDX
      ADD  RDI, $2
      TEST CL,  $4
      JNZ  @Aligned8 { Note that it's NOT zero here, because if TEST CL, $4 sets ZF here, then the memory 
block was originally 2 bytes away from the boundary }
    @Aligned4:
      MOV  [RDI], EAX
      SUB  RDX, $2
      ADD  RDI, $4

    @Aligned8:
      MOV  R10B,DL
      SHR  RDX, 2
      AND  R10B,$3
      MOV  RCX, RDX
      CMP  RDX, $80000
      JB   @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ }
      SHR  RDX, 2
      AND  RCX, $3

    { Write 32 bytes at a time using a non-temporal hint }
    @BlockLoop:
      ADD  RDI, $20
      MOVNTI [RDI-$20], RAX
      MOVNTI [RDI-$18], RAX
      DEC  RDX
      MOVNTI [RDI-$10], RAX
      MOVNTI [RDI-$8], RAX
      JNZ  @BlockLoop
      MFENCE

    @NoBlocks:
      SHR  R10B, 1
      REP  STOSQ
      JNC  @NoLooseWord
      MOV  [RDI], R8W
      LEA  RDI, [RDI+2]

    @NoLooseWord:
      JZ   @NoLooseDWord
      MOV  [RDI], EAX

    @NoLooseDWord:
      POP  RDI
    end;

    procedure SpeedOptimisedFillDWord(var x; count: SizeInt; Value: DWord); assembler; nostackframe;
    asm
      { RCX = Pointer to x
        RDX = Count
        R8W = Value }
      PUSH RDI
      MOV  RAX, R8
      MOV  RDI, RCX
      SHL  RAX, 32
      OR   RAX, R8

      { Do some memory alignment first (it should be at least aligned to a 32-bit boundary already) }
      AND  CL,  $4
      JZ   @Aligned8
      MOV  [RDI], R8D
      DEC  RDX
      ADD  RDI, $4

    @Aligned8:
      SHR  RDX, 1
      SETC R10B
      MOV  RCX, RDX
      CMP  RDX, $80000
      JB   @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ }
      SHR  RDX, 2
      AND  RCX, $3
    { Write 32 bytes at a time using a non-temporal hint }
    @BlockLoop:
      ADD  RDI, $20
      MOVNTI [RDI-$20], RAX
      MOVNTI [RDI-$18], RAX
      DEC  RDX
      MOVNTI [RDI-$10], RAX
      MOVNTI [RDI-$8], RAX
      JNZ  @BlockLoop
      MFENCE
    @NoBlocks:
      TEST R10B, R10B
      REP  STOSQ

      JZ   @NoLooseDWord
      MOV  [RDI], EAX

    @NoLooseDWord:
      POP  RDI
    end;

    procedure SpeedOptimisedFillQWord(var x; count: SizeInt; Value: QWord); assembler; nostackframe;
    asm
      { RCX = Pointer to x
        RDX = Count
        R8  = Value }
      PUSH RDI
      CMP  RDX, $80000
      MOV  RDI, RCX
      MOV  RCX, RDX
      JB   @NoBlocks { To small for the non-temporal hint to be worthwhile, so just use STOSQ }
      AND  RCX, $3
      SHR  RDX, 2
      JZ   @NoBlocks
    { Write 32 bytes at a time using a non-temporal hint }
    @BlockLoop:
      ADD  RDI, $20
      MOVNTI [RDI-$20], R8
      MOVNTI [RDI-$18], R8
      DEC  RDX
      MOVNTI [RDI-$10], R8
      MOVNTI [RDI-$8], R8
      JNZ  @BlockLoop
      MFENCE
    @NoBlocks:
      MOV  RAX, R8
      REP  STOSQ
      POP  RDI
    end;

Regarding the CFI annotations, these functions are actually even better under Linux x64 because RDI is 
volatile and doesn't need to be pushed and popped, and those operations were the only things that modified 
the stack pointer... and since the above routines don't call any other procedures, we can use "nostackframe" 
safely.

I am tempted to experiment a little further, because one thing that's guaranteed to be present under x64 is 
SSE2, so it may be possible to increase the speed even more, although at the same time there may be a 
performance penalty if the rest of the application uses AVX or floating-point SSE.

J. Gareth "Kit" Moreton

On Wed 01/11/17 11:03 , Sergei Gorelkin via fpc-devel fpc-devel at lists.freepascal.org sent:
> 
> 
> 
> 
> 01.11.2017 10:46, Sven Barth via fpc-devel wrote:
> 
> > Am 01.11.2017 05:58 schrieb "J. Gareth
> Moreton"  eth at moreton-family.com 
> >  eth at moreton-family.com>>:
> > 
> 
> >     Would it be worth opening up a bug report
> for this, with the attached assembler routines as
> >     suggestions? I
> 
> >     haven't worked out how to implement internal
> functions into the compiler yet, and I rather clear
> >     it with you
> 
> >     guys first before I make such an
> addition.  I had a thought that the simple routines above could
> >     be used for
> 
> >     when compiling for small code size, while
> larger, more advanced ones are used for when compiling
> >     for speed.
> 
> > 
> 
> > 
> 
> > Improvements like these are always welcome. Two
> points however:
> > The Fill* routines are not part of the compiler,
> but of the RTL (the Pascal routines are in 
> > rtl/inc/generic.inc, the assembly ones reside in
> rtl/CPU/CPU.inc) and they aren't handled 
> > differently depending on the current
> optimization flags, so a one-size-fits-all is needed (look at 
> > e.g. the i386 ones).
> 
> > I also think that you might need to handle
> memory that isn't correctly aligned for the assembler 
> > instructions (I didn't look at your routines in
> detail so I don't know whether they'd need to be 
> > adjusted for that). A check of the i386 routines
> will probably help here as well.
> > 
> 
> 
> 
> Another important thing to note is that all modifications to stack pointer
> and nonvolatile registers 
> on x86_64 need SEH annotations in win64 and CFI annotations on linux/bsd.
> The former is available 
> only in AT&T syntax, the latter is not supported.
> 
> This requierment, together with different parameter locations, makes
> writing assembler routines for 
> x86_64 much more complicated than for i386. For this reason, existing
> assembler routines in RTL 
> avoid using nonvolatile registers as much as possible.
> 
> Regards,
> 
> Sergei
> 
> _______________________________________________
> 
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> 
> 
> 
>