[fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential

Tue Apr 19 12:38:49 CEST 2022

If you want to zero small records more efficiently it might be better using Default(t) for that and looking at optimizing the code the compiler generates for that as it seems it produces an empty temp variable which it assigns instead of simply zeroing the record variable where default() is being assigned to.

> On 16/04/2022 01:26 J. Gareth Moreton via fpc-devel <fpc-devel at lists.freepascal.org> wrote:
> 
>  
> Hi everyone,
> 
> This is something that sprung to mind when thinking about code speed and 
> the like, and one thing that cropped up is the initialisation of large 
> variables such as arrays or records.  A common means of doing this is, say:
> 
> FillChar(MyVar, SizeOf(MyVar), 0);
> 
> To keep things as general-purpose as possible, this usually results in a 
> function call that decides the best course of action, and for very large 
> blocks of data whose size may not be deterministic (e.g. a file buffer), 
> this is the best approach - the overhead is relatively small and it 
> quickly uses fast block-move instructions.
> 
> However, for small-to-mid-sized variables of known size, this can lead 
> to some inefficiencies, first by not taking into account that the size 
> of the variable is known, but also because the initialisation value is 
> zero, more often that not, and the variable is probably aligned on the 
> stack (so the checks to make sure a pointer is aligned are unnecessary).
> 
> I did a proof of concept on x86_64-win64 with the following record:
> 
> type
>    TTestRecord = record
>      Field1: Byte;
>      Field2, Field3, Field4: Integer;
>    end;
> 
> SizeOf(TTestRecord) is 16 and all the fields are on 4-byte boundaries.  
> Nothing particularly special.
> 
> I then declared a variable of this time and filled the fields with 
> random values, and then ran two different methods to clear their 
> memory.  To get a good speed average, I ran each method 1,000,000,000 
> times in a for-loop.  The first method was:
> 
> FillChar(TestRecord, SizeOf(TestRecord), 0);
> 
> The second method was inline assembly language (which I've called 'the 
> intrinsic'):
> 
> asm
>    PXOR   XMM0, XMM0
>    MOVDQU [RIP+TestRecord], XMM0
> end;2
> 
> It's not perfect because the presence of inline assembly prevents the 
> use of register variables (although TestRecord is always on the stack 
> regardless), but the performance hit is barely noticeable in this case, 
> and if the assembly language were inserted by the compiler, the register 
> variable problem won't arise.
> 
> These are my results:
> 
>   FillChar time: 2.398 ns
> 
> Field1 = 0
> Field2 = 0
> Field3 = 0
> Field4 = 0
> 
> Intrinsic time: 1.336 ns
> 
> Field1 = 0
> Field2 = 0
> Field3 = 0
> Field4 = 0
> 
> Sure, it's on the order of nanoseconds, but the intrinsic is almost 
> twice as fast.
> 
> In terms of size - FillChar call = 20 bytes:
> 
> 488d0d22080200           lea 0x20822(%rip),%rcx        # 0x100022010
> 4531c0                   xor    %r8d,%r8d
> ba10000000               mov    $0x10,%edx
> e8150a0000               callq  0x100002210 
> <SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>
> 
> The intrinsic = 12 bytes:
> 
> 660fefc0                 pxor %xmm0,%xmm0
> f30f7f05bd050200         movdqu %xmm0,0x205bd(%rip)        # 0x100022010
> 
> For a 32-byte record instead, an extra 8-byte MOVDQU instruction would 
> be required, so the 2 would be equal size, but with the bonus that the 
> intrinsic doesn't have a function call and will probably help 
> optimisation in the rest of the procedure by freeing up the registers 
> used to pass parameters (%rcx, %rdx and %r8 in this case; although the 
> intrinsic will require an MM register in this x86_64 example, they tend 
> to not be used as often).  Also, the peephole optimizer can remove 
> redundant PXOR XMM0, XMM0 calls, which will help as well if there are 
> multiple FillChar calls.
> 
> I'm not proposing a total rewrite, and I would say that in the default 
> case, it should just fall back to the in-built System functions, but the 
> relevant compiler nodes could be overridden on specific platforms to 
> generate smaller, more optimised code when the sizes and values are 
> known at compile time.
> 
> Now, in this example, it is still faster to simply set the fields 
> manually one-by-one (clocks in at around 1.2 ns), possibly due to the 
> unaligned write (MOVDQU) and internal SSE state switching adding some 
> overhead, but there's nothing to stop the compiler from inserting code 
> in place of the FillChar call to do just that if it thinks it's the 
> fastest method.  Then again, one has to be a little bit careful because 
> FillChar and the intrinsic will also set the filler bytes between Field1 
> and Field2 to 0, whereas manually assigning 0 to the fields won't (so 
> they aren't strictly equivalent and might only be allowed if there are 
> no filler bytes or when compiling under -O4, but the latter may still be 
> dangerous when typecasting is concerned), and extra care would have to 
> be taken when unions are concerned (sorry, 'union' that's a C term - 
> what's the official Pascal term again?).
> 
> Actual Pascal calls to FillChar would not change in any way and so 
> theoretically it won't break existing code.  The only drawback is that 
> the intrinsic and the internal System functions would have to be named 
> the same so constructs such as "FuncPtr := @FillChar;" as well as 
> calling FillChar from assembler routines stilll work, and the compiler 
> would have to know how to differentiate between the two.
> 
> Just on the surface, what are your thoughts?
> 
> Garetha ka. Kit
> 
> 
> -- 
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
> 
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel