<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body smarttemplateinserted="true">
<div id="smartTemplate4-template">Hi,<br>
</div>
<div><br>
</div>
<div>it could always inline it.</div>
<div><br>
</div>
<div>For small sizes do that mov and for large sizes do rep stosb on
x86. It is very fast nowadays. Faster than FillChar on my Intel
laptop. (except for mid sizes like 128 bytes)<br>
<br>
<br>
Bye,<br>
Benito </div>
<div class="moz-cite-prefix">On 16.04.22 01:26, J. Gareth Moreton
via fpc-devel wrote:<br>
</div>
<blockquote type="cite"
cite="mid:bb86d231-821e-5098-e038-2992d44217b6@moreton-family.com">Hi
everyone,
<br>
<br>
This is something that sprung to mind when thinking about code
speed and the like, and one thing that cropped up is the
initialisation of large variables such as arrays or records. A
common means of doing this is, say:
<br>
<br>
FillChar(MyVar, SizeOf(MyVar), 0);
<br>
<br>
To keep things as general-purpose as possible, this usually
results in a function call that decides the best course of action,
and for very large blocks of data whose size may not be
deterministic (e.g. a file buffer), this is the best approach -
the overhead is relatively small and it quickly uses fast
block-move instructions.
<br>
<br>
However, for small-to-mid-sized variables of known size, this can
lead to some inefficiencies, first by not taking into account that
the size of the variable is known, but also because the
initialisation value is zero, more often that not, and the
variable is probably aligned on the stack (so the checks to make
sure a pointer is aligned are unnecessary).
<br>
<br>
I did a proof of concept on x86_64-win64 with the following
record:
<br>
<br>
type
<br>
TTestRecord = record
<br>
Field1: Byte;
<br>
Field2, Field3, Field4: Integer;
<br>
end;
<br>
<br>
SizeOf(TTestRecord) is 16 and all the fields are on 4-byte
boundaries. Nothing particularly special.
<br>
<br>
I then declared a variable of this time and filled the fields with
random values, and then ran two different methods to clear their
memory. To get a good speed average, I ran each method
1,000,000,000 times in a for-loop. The first method was:
<br>
<br>
FillChar(TestRecord, SizeOf(TestRecord), 0);
<br>
<br>
The second method was inline assembly language (which I've called
'the intrinsic'):
<br>
<br>
asm
<br>
PXOR XMM0, XMM0
<br>
MOVDQU [RIP+TestRecord], XMM0
<br>
end;2
<br>
<br>
It's not perfect because the presence of inline assembly prevents
the use of register variables (although TestRecord is always on
the stack regardless), but the performance hit is barely
noticeable in this case, and if the assembly language were
inserted by the compiler, the register variable problem won't
arise.
<br>
<br>
These are my results:
<br>
<br>
FillChar time: 2.398 ns
<br>
<br>
Field1 = 0
<br>
Field2 = 0
<br>
Field3 = 0
<br>
Field4 = 0
<br>
<br>
Intrinsic time: 1.336 ns
<br>
<br>
Field1 = 0
<br>
Field2 = 0
<br>
Field3 = 0
<br>
Field4 = 0
<br>
<br>
Sure, it's on the order of nanoseconds, but the intrinsic is
almost twice as fast.
<br>
<br>
In terms of size - FillChar call = 20 bytes:
<br>
<br>
488d0d22080200 lea 0x20822(%rip),%rcx #
0x100022010
<br>
4531c0 xor %r8d,%r8d
<br>
ba10000000 mov $0x10,%edx
<br>
e8150a0000 callq 0x100002210
<SYSTEM_$$_FILLCHAR$formal$INT64$BYTE>
<br>
<br>
The intrinsic = 12 bytes:
<br>
<br>
660fefc0 pxor %xmm0,%xmm0
<br>
f30f7f05bd050200 movdqu %xmm0,0x205bd(%rip) #
0x100022010
<br>
<br>
For a 32-byte record instead, an extra 8-byte MOVDQU instruction
would be required, so the 2 would be equal size, but with the
bonus that the intrinsic doesn't have a function call and will
probably help optimisation in the rest of the procedure by freeing
up the registers used to pass parameters (%rcx, %rdx and %r8 in
this case; although the intrinsic will require an MM register in
this x86_64 example, they tend to not be used as often). Also,
the peephole optimizer can remove redundant PXOR XMM0, XMM0 calls,
which will help as well if there are multiple FillChar calls.
<br>
<br>
I'm not proposing a total rewrite, and I would say that in the
default case, it should just fall back to the in-built System
functions, but the relevant compiler nodes could be overridden on
specific platforms to generate smaller, more optimised code when
the sizes and values are known at compile time.
<br>
<br>
Now, in this example, it is still faster to simply set the fields
manually one-by-one (clocks in at around 1.2 ns), possibly due to
the unaligned write (MOVDQU) and internal SSE state switching
adding some overhead, but there's nothing to stop the compiler
from inserting code in place of the FillChar call to do just that
if it thinks it's the fastest method. Then again, one has to be a
little bit careful because FillChar and the intrinsic will also
set the filler bytes between Field1 and Field2 to 0, whereas
manually assigning 0 to the fields won't (so they aren't strictly
equivalent and might only be allowed if there are no filler bytes
or when compiling under -O4, but the latter may still be dangerous
when typecasting is concerned), and extra care would have to be
taken when unions are concerned (sorry, 'union' that's a C term -
what's the official Pascal term again?).
<br>
<br>
Actual Pascal calls to FillChar would not change in any way and so
theoretically it won't break existing code. The only drawback is
that the intrinsic and the internal System functions would have to
be named the same so constructs such as "FuncPtr := @FillChar;" as
well as calling FillChar from assembler routines stilll work, and
the compiler would have to know how to differentiate between the
two.
<br>
<br>
Just on the surface, what are your thoughts?
<br>
<br>
Garetha ka. Kit
<br>
<br>
<br>
</blockquote>
</body>
</html>