[fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential

Sun Apr 17 10:59:47 CEST 2022

Florian Klämpfl via fpc-devel <fpc-devel at lists.freepascal.org> schrieb am
Sa., 16. Apr. 2022, 21:00:

>
>
> > Am 16.04.2022 um 01:26 schrieb J. Gareth Moreton via fpc-devel <
> fpc-devel at lists.freepascal.org>:
> >
> > Hi everyone,
> >
> > This is something that sprung to mind when thinking about code speed and
> the like, and one thing that cropped up is the initialisation of large
> variables such as arrays or records.  A common means of doing this is, say:
> >
> > FillChar(MyVar, SizeOf(MyVar), 0);
> >
> > To keep things as general-purpose as possible, this usually results in a
> function call that decides the best course of action, and for very large
> blocks of data whose size may not be deterministic (e.g. a file buffer),
> this is the best approach - the overhead is relatively small and it quickly
> uses fast block-move instructions.
> >
> > However, for small-to-mid-sized variables of known size, this can lead
> to some inefficiencies, first by not taking into account that the size of
> the variable is known, but also because the initialisation value is zero,
> more often that not, and the variable is probably aligned on the stack (so
> the checks to make sure a pointer is aligned are unnecessary).
> >
> > I did a proof of concept on x86_64-win64 with the following record:
> >
> > type
> >   TTestRecord = record
> >     Field1: Byte;
> >     Field2, Field3, Field4: Integer;
> >   end;
> >
> > SizeOf(TTestRecord) is 16 and all the fields are on 4-byte boundaries.
> Nothing particularly special.
> >
> > I then declared a variable of this time and filled the fields with
> random values, and then ran two different methods to clear their memory.
> To get a good speed average, I ran each method 1,000,000,000 times in a
> for-loop.  The first method was:
> >
> > FillChar(TestRecord, SizeOf(TestRecord), 0);
> >
> > The second method was inline assembly language (which I've called 'the
> intrinsic'):
> >
> > asm
> >   PXOR   XMM0, XMM0
> >   MOVDQU [RIP+TestRecord], XMM0
> > end;2
> >
> > It's not perfect because the presence of inline assembly prevents the
> use of register variables (although TestRecord is always on the stack
> regardless), but the performance hit is barely noticeable in this case, and
> if the assembly language were inserted by the compiler, the register
> variable problem won't arise.
> >
> > These are my results:
> >
> >  FillChar time: 2.398 ns
> >
> > Field1 = 0
> > Field2 = 0
> > Field3 = 0
> > Field4 = 0
> >
> > Intrinsic time: 1.336 ns
> >
> > Field1 = 0
> > Field2 = 0
> > Field3 = 0
> > Field4 = 0
> >
> > Sure, it's on the order of nanoseconds, but the intrinsic is almost
> twice as fast.
> >
> > In terms of size - FillChar call = 20 bytes:
> >
> > 488d0d22080200           lea 0x20822(%rip),%rcx        # 0x100022010
> > 4531c0                   xor    %r8d,%r8d
> > ba10000000               mov    $0x10,%edx
> > e8150a0000               callq  0x100002210
> <SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>
> >
> > The intrinsic = 12 bytes:
> >
> > 660fefc0                 pxor %xmm0,%xmm0
> > f30f7f05bd050200         movdqu %xmm0,0x205bd(%rip)        # 0x100022010
> >
> > For a 32-byte record instead, an extra 8-byte MOVDQU instruction would
> be required, so the 2 would be equal size, but with the bonus that the
> intrinsic doesn't have a function call and will probably help optimisation
> in the rest of the procedure by freeing up the registers used to pass
> parameters (%rcx, %rdx and %r8 in this case; although the intrinsic will
> require an MM register in this x86_64 example, they tend to not be used as
> often).  Also, the peephole optimizer can remove redundant PXOR XMM0, XMM0
> calls, which will help as well if there are multiple FillChar calls.
> >
> > I'm not proposing a total rewrite, and I would say that in the default
> case, it should just fall back to the in-built System functions, but the
> relevant compiler nodes could be overridden on specific platforms to
> generate smaller, more optimised code when the sizes and values are known
> at compile time.
> >
> > Now, in this example, it is still faster to simply set the fields
> manually one-by-one (clocks in at around 1.2 ns), possibly due to the
> unaligned write (MOVDQU) and internal SSE state switching adding some
> overhead, but there's nothing to stop the compiler from inserting code in
> place of the FillChar call to do just that if it thinks it's the fastest
> method.  Then again, one has to be a little bit careful because FillChar
> and the intrinsic will also set the filler bytes between Field1 and Field2
> to 0, whereas manually assigning 0 to the fields won't (so they aren't
> strictly equivalent and might only be allowed if there are no filler bytes
> or when compiling under -O4, but the latter may still be dangerous when
> typecasting is concerned), and extra care would have to be taken when
> unions are concerned (sorry, 'union' that's a C term - what's the official
> Pascal term again?).
> >
> > Actual Pascal calls to FillChar would not change in any way and so
> theoretically it won't break existing code.  The only drawback is that the
> intrinsic and the internal System functions would have to be named the same
> so constructs such as "FuncPtr := @FillChar;" as well as calling FillChar
> from assembler routines stilll work, and the compiler would have to know
> how to differentiate between the two.
> >
> > Just on the surface, what are your thoughts?
>
> Inlining FillChar is for sure useful (same for move). The FillChar in the
> system unit could stay, the compiler could just replace a call to
> System.FillChar by some compiler generated assembler doing the FillChar.
>

But we should have a general mechanism for that, not something that just
handles FillChar.

Regards,
Sven

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20220417/d415d4c2/attachment-0001.htm>