[fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential

Sat Apr 16 01:26:52 CEST 2022

Hi everyone,

This is something that sprung to mind when thinking about code speed and 
the like, and one thing that cropped up is the initialisation of large 
variables such as arrays or records.  A common means of doing this is, say:

FillChar(MyVar, SizeOf(MyVar), 0);

To keep things as general-purpose as possible, this usually results in a 
function call that decides the best course of action, and for very large 
blocks of data whose size may not be deterministic (e.g. a file buffer), 
this is the best approach - the overhead is relatively small and it 
quickly uses fast block-move instructions.

However, for small-to-mid-sized variables of known size, this can lead 
to some inefficiencies, first by not taking into account that the size 
of the variable is known, but also because the initialisation value is 
zero, more often that not, and the variable is probably aligned on the 
stack (so the checks to make sure a pointer is aligned are unnecessary).

I did a proof of concept on x86_64-win64 with the following record:

type
   TTestRecord = record
     Field1: Byte;
     Field2, Field3, Field4: Integer;
   end;

SizeOf(TTestRecord) is 16 and all the fields are on 4-byte boundaries.  
Nothing particularly special.

I then declared a variable of this time and filled the fields with 
random values, and then ran two different methods to clear their 
memory.  To get a good speed average, I ran each method 1,000,000,000 
times in a for-loop.  The first method was:

FillChar(TestRecord, SizeOf(TestRecord), 0);

The second method was inline assembly language (which I've called 'the 
intrinsic'):

asm
   PXOR   XMM0, XMM0
   MOVDQU [RIP+TestRecord], XMM0
end;2

It's not perfect because the presence of inline assembly prevents the 
use of register variables (although TestRecord is always on the stack 
regardless), but the performance hit is barely noticeable in this case, 
and if the assembly language were inserted by the compiler, the register 
variable problem won't arise.

These are my results:

  FillChar time: 2.398 ns

Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0

Intrinsic time: 1.336 ns

Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0

Sure, it's on the order of nanoseconds, but the intrinsic is almost 
twice as fast.

In terms of size - FillChar call = 20 bytes:

488d0d22080200           lea 0x20822(%rip),%rcx        # 0x100022010
4531c0                   xor    %r8d,%r8d
ba10000000               mov    $0x10,%edx
e8150a0000               callq  0x100002210 
<SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>

The intrinsic = 12 bytes:

660fefc0                 pxor %xmm0,%xmm0
f30f7f05bd050200         movdqu %xmm0,0x205bd(%rip)        # 0x100022010

For a 32-byte record instead, an extra 8-byte MOVDQU instruction would 
be required, so the 2 would be equal size, but with the bonus that the 
intrinsic doesn't have a function call and will probably help 
optimisation in the rest of the procedure by freeing up the registers 
used to pass parameters (%rcx, %rdx and %r8 in this case; although the 
intrinsic will require an MM register in this x86_64 example, they tend 
to not be used as often).  Also, the peephole optimizer can remove 
redundant PXOR XMM0, XMM0 calls, which will help as well if there are 
multiple FillChar calls.

I'm not proposing a total rewrite, and I would say that in the default 
case, it should just fall back to the in-built System functions, but the 
relevant compiler nodes could be overridden on specific platforms to 
generate smaller, more optimised code when the sizes and values are 
known at compile time.

Now, in this example, it is still faster to simply set the fields 
manually one-by-one (clocks in at around 1.2 ns), possibly due to the 
unaligned write (MOVDQU) and internal SSE state switching adding some 
overhead, but there's nothing to stop the compiler from inserting code 
in place of the FillChar call to do just that if it thinks it's the 
fastest method.  Then again, one has to be a little bit careful because 
FillChar and the intrinsic will also set the filler bytes between Field1 
and Field2 to 0, whereas manually assigning 0 to the fields won't (so 
they aren't strictly equivalent and might only be allowed if there are 
no filler bytes or when compiling under -O4, but the latter may still be 
dangerous when typecasting is concerned), and extra care would have to 
be taken when unions are concerned (sorry, 'union' that's a C term - 
what's the official Pascal term again?).

Actual Pascal calls to FillChar would not change in any way and so 
theoretically it won't break existing code.  The only drawback is that 
the intrinsic and the internal System functions would have to be named 
the same so constructs such as "FuncPtr := @FillChar;" as well as 
calling FillChar from assembler routines stilll work, and the compiler 
would have to know how to differentiate between the two.

Just on the surface, what are your thoughts?

Garetha ka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus