[fpc-devel] Thoughts: Make FillChar etc. an intrinsic for specialised performance potential
J. Gareth Moreton
gareth at moreton-family.com
Sat Apr 16 01:26:52 CEST 2022
Hi everyone,
This is something that sprung to mind when thinking about code speed and
the like, and one thing that cropped up is the initialisation of large
variables such as arrays or records. A common means of doing this is, say:
FillChar(MyVar, SizeOf(MyVar), 0);
To keep things as general-purpose as possible, this usually results in a
function call that decides the best course of action, and for very large
blocks of data whose size may not be deterministic (e.g. a file buffer),
this is the best approach - the overhead is relatively small and it
quickly uses fast block-move instructions.
However, for small-to-mid-sized variables of known size, this can lead
to some inefficiencies, first by not taking into account that the size
of the variable is known, but also because the initialisation value is
zero, more often that not, and the variable is probably aligned on the
stack (so the checks to make sure a pointer is aligned are unnecessary).
I did a proof of concept on x86_64-win64 with the following record:
type
TTestRecord = record
Field1: Byte;
Field2, Field3, Field4: Integer;
end;
SizeOf(TTestRecord) is 16 and all the fields are on 4-byte boundaries.
Nothing particularly special.
I then declared a variable of this time and filled the fields with
random values, and then ran two different methods to clear their
memory. To get a good speed average, I ran each method 1,000,000,000
times in a for-loop. The first method was:
FillChar(TestRecord, SizeOf(TestRecord), 0);
The second method was inline assembly language (which I've called 'the
intrinsic'):
asm
PXOR XMM0, XMM0
MOVDQU [RIP+TestRecord], XMM0
end;2
It's not perfect because the presence of inline assembly prevents the
use of register variables (although TestRecord is always on the stack
regardless), but the performance hit is barely noticeable in this case,
and if the assembly language were inserted by the compiler, the register
variable problem won't arise.
These are my results:
FillChar time: 2.398 ns
Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0
Intrinsic time: 1.336 ns
Field1 = 0
Field2 = 0
Field3 = 0
Field4 = 0
Sure, it's on the order of nanoseconds, but the intrinsic is almost
twice as fast.
In terms of size - FillChar call = 20 bytes:
488d0d22080200 lea 0x20822(%rip),%rcx # 0x100022010
4531c0 xor %r8d,%r8d
ba10000000 mov $0x10,%edx
e8150a0000 callq 0x100002210
<SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>
The intrinsic = 12 bytes:
660fefc0 pxor %xmm0,%xmm0
f30f7f05bd050200 movdqu %xmm0,0x205bd(%rip) # 0x100022010
For a 32-byte record instead, an extra 8-byte MOVDQU instruction would
be required, so the 2 would be equal size, but with the bonus that the
intrinsic doesn't have a function call and will probably help
optimisation in the rest of the procedure by freeing up the registers
used to pass parameters (%rcx, %rdx and %r8 in this case; although the
intrinsic will require an MM register in this x86_64 example, they tend
to not be used as often). Also, the peephole optimizer can remove
redundant PXOR XMM0, XMM0 calls, which will help as well if there are
multiple FillChar calls.
I'm not proposing a total rewrite, and I would say that in the default
case, it should just fall back to the in-built System functions, but the
relevant compiler nodes could be overridden on specific platforms to
generate smaller, more optimised code when the sizes and values are
known at compile time.
Now, in this example, it is still faster to simply set the fields
manually one-by-one (clocks in at around 1.2 ns), possibly due to the
unaligned write (MOVDQU) and internal SSE state switching adding some
overhead, but there's nothing to stop the compiler from inserting code
in place of the FillChar call to do just that if it thinks it's the
fastest method. Then again, one has to be a little bit careful because
FillChar and the intrinsic will also set the filler bytes between Field1
and Field2 to 0, whereas manually assigning 0 to the fields won't (so
they aren't strictly equivalent and might only be allowed if there are
no filler bytes or when compiling under -O4, but the latter may still be
dangerous when typecasting is concerned), and extra care would have to
be taken when unions are concerned (sorry, 'union' that's a C term -
what's the official Pascal term again?).
Actual Pascal calls to FillChar would not change in any way and so
theoretically it won't break existing code. The only drawback is that
the intrinsic and the internal System functions would have to be named
the same so constructs such as "FuncPtr := @FillChar;" as well as
calling FillChar from assembler routines stilll work, and the compiler
would have to know how to differentiate between the two.
Just on the surface, what are your thoughts?
Garetha ka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
More information about the fpc-devel
mailing list