[fpc-devel] x86_64.inc CompareByte
Markus Beth
markus.beth at zkrd.de
Mon Oct 23 00:25:28 CEST 2017
I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].
[1]:
var
key, key2: string;
res: LongWord;
i: SizeInt;
begin
key := 'A';
key2 := 'A';
for i:= 0 to 1000000000 do begin
res := CompareByte(key[1], key2[1], Length(key));
end;
end.
[2]:
var
key, key2: string;
res: LongWord;
i: SizeInt;
begin
SetLength(key,10240 * 1024);
SetLength(key2,10240 * 1024);
for i:= 0 to 10000 do begin
hash := CompareByte_RTL(key[1], key2[1], Length(key));
end;
end.
The measurement takes place on a Intel Core i5 CPU M520 at 2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via
perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte
after the following setup:
cpupower frequency-set -g performance
echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)
The output for [1] using the current RTL CompareByte is:
11.336.449.124 cycles ( +- 0,05% )
28.077.280.776 instructions # 2,48 insn per cycle ( +- 0,00% )
4,736782553 seconds time elapsed ( +- 0,05% )
The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
10.293.397.316 cycles ( +- 0,01% )
26.070.305.490 instructions # 2,53 insn per cycle ( +- 0,00% )
4,301081734 seconds time elapsed ( +- 0,01% )
The output for [2] using the current RTL CompareByte is:
325.526.707.243 cycles ( +- 0,31% )
736.237.912.850 instructions # 2,26 insn per cycle ( +- 0,00% )
136,013215979 seconds time elapsed ( +- 0,31% )
The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410 cycles ( +- 0,95% )
525.832.575.056 instructions # 2,34 insn per cycle ( +- 0,00% )
93,851685247 seconds time elapsed ( +- 0,95% )
I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.
On 22.10.2017 20:55, Florian Klämpfl wrote:
> Am 21.10.2017 um 01:24 schrieb Markus Beth:
>> Find attached the already announced version of CompareByte.
>>
>
> What benchmark did you use? In my tests it is slightly slower than that one of fpc 3.0.x?
>
> I used the following test program:
>
> var
> buf1,buf2 : array[0..127] of byte;
> pos,len,i,j : longint;
>
> begin
> for i:=1 to 100 do
> begin
> len:=random(100);
> for j:=0 to len-1 do
> begin
> buf1[j]:=random(256);
> buf2[j]:=random(256);
> end;
>
> for j:=0 to random(10) do
> buf2[j]:=buf1[j];
>
> for j:=1 to 1000000 do
> CompareByte(buf1,buf2,len);
> end;
> end.
>
>>
>>
>> On 16.10.2017 23:08, Markus Beth wrote:
>>> On 16.10.2017 22:41, Florian Klämpfl wrote:
>>>>> P.S.: I am currently working on another version of CompareByte that might have a slightly higher
>>>>> latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel
>>>>> Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and
>>>>> benchmarking.
>>>>> I can come up with it here again if this would be of any interest.
>>>>
>>>> Small lengths in terms of matching string or overall lengths?
>>>
>>> It is small length in terms of matching string as there is some setup work before the loop.
>>>
>>>> BTW: I would really like to see a PCMPSTR based implementation :)
>>> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you
>>> deal with Intel core microarchitecture CPUs that don't have it?
More information about the fpc-devel
mailing list