[fpc-devel] x86_64.inc CompareByte
Markus Beth
markus.beth at zkrd.de
Mon Oct 23 22:58:16 CEST 2017
Here are the numbers for on ivy bridge CPU:
The output for [1] using the current RTL CompareByte is:
9.001.275.281 cycles:u ( +- 0,00% )
28.000.560.462 instructions:u # 3,11 insn per cycle ( +- 0,00% )
2,654735815 seconds time elapsed ( +- 0,00% )
The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
9.002.038.628 cycles:u ( +- 0,01% )
26.000.559.441 instructions:u # 2,89 insn per cycle ( +- 0,00% )
2,655002891 seconds time elapsed ( +- 0,01% )
The output for [2] using the current RTL CompareByte is:
227.941.173.371 cycles:u ( +- 0,00% )
734.077.388.160 instructions:u # 3,22 insn per cycle ( +- 0,00% )
67,215188648 seconds time elapsed ( +- 0,00% )
The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
210.694.292.040 cycles:u ( +- 0,00% )
524.341.215.569 instructions:u # 2,49 insn per cycle ( +- 0,00% )
62,129294243 seconds time elapsed ( +- 0,00% )
With Florian's benchmark I also observe that the patched version is
slightly slower than the original. But I have no idea why this is so.
On 23.10.2017 00:25, Markus Beth wrote:
> I used 2 different benchmarks. One for (very) short buffers [1] and one
> for rather large buffers [2].
>
> [1]:
> var
> key, key2: string;
> res: LongWord;
> i: SizeInt;
>
> begin
> key := 'A';
> key2 := 'A';
> for i:= 0 to 1000000000 do begin
> res := CompareByte(key[1], key2[1], Length(key));
> end;
> end.
>
> [2]:
> var
> key, key2: string;
> res: LongWord;
> i: SizeInt;
>
> begin
> SetLength(key,10240 * 1024);
> SetLength(key2,10240 * 1024);
> for i:= 0 to 10000 do begin
> hash := CompareByte_RTL(key[1], key2[1], Length(key));
> end;
> end.
>
>
> The measurement takes place on a Intel Core i5 CPU M520 at 2.40GHz which
> has a Westmere Microarchitecture. The programs are run on an otherwise
> idle Linux (OpenSuse Tumbleweed) system via
>
> perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte
>
> after the following setup:
> cpupower frequency-set -g performance
> echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)
>
>
> The output for [1] using the current RTL CompareByte is:
> 11.336.449.124 cycles ( +- 0,05% )
> 28.077.280.776 instructions # 2,48 insn per cycle ( +- 0,00% )
> 4,736782553 seconds time elapsed ( +- 0,05% )
>
> The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
> 10.293.397.316 cycles ( +- 0,01% )
> 26.070.305.490 instructions # 2,53 insn per cycle ( +- 0,00% )
> 4,301081734 seconds time elapsed ( +- 0,01% )
>
> The output for [2] using the current RTL CompareByte is:
> 325.526.707.243 cycles ( +- 0,31% )
> 736.237.912.850 instructions # 2,26 insn per cycle ( +- 0,00% )
> 136,013215979 seconds time elapsed ( +- 0,31% )
>
> The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
> 224.621.009.410 cycles ( +- 0,95% )
> 525.832.575.056 instructions # 2,34 insn per cycle ( +- 0,00% )
> 93,851685247 seconds time elapsed ( +- 0,95% )
>
>
> I hopefully can come up with the corresponding numbers for a ivy bridge
> CPU tomorrow.
>
>
> On 22.10.2017 20:55, Florian Klämpfl wrote:
>> Am 21.10.2017 um 01:24 schrieb Markus Beth:
>>> Find attached the already announced version of CompareByte.
>>>
>>
>> What benchmark did you use? In my tests it is slightly slower than
>> that one of fpc 3.0.x?
>>
>> I used the following test program:
>>
>> var
>> buf1,buf2 : array[0..127] of byte;
>> pos,len,i,j : longint;
>>
>> begin
>> for i:=1 to 100 do
>> begin
>> len:=random(100);
>> for j:=0 to len-1 do
>> begin
>> buf1[j]:=random(256);
>> buf2[j]:=random(256);
>> end;
>>
>> for j:=0 to random(10) do
>> buf2[j]:=buf1[j];
>>
>> for j:=1 to 1000000 do
>> CompareByte(buf1,buf2,len);
>> end;
>> end.
>>
>>>
>>>
>>> On 16.10.2017 23:08, Markus Beth wrote:
>>>> On 16.10.2017 22:41, Florian Klämpfl wrote:
>>>>>> P.S.: I am currently working on another version of CompareByte
>>>>>> that might have a slightly higher
>>>>>> latency for very small len but a higher throughput (2 cycles per
>>>>>> iteration vs. 3 cycles on an Intel
>>>>>> Arrandale CPU (Westmere microarchitecture)). But this would need
>>>>>> some more testing and
>>>>>> benchmarking.
>>>>>> I can come up with it here again if this would be of any interest.
>>>>>
>>>>> Small lengths in terms of matching string or overall lengths?
>>>>
>>>> It is small length in terms of matching string as there is some
>>>> setup work before the loop.
>>>>
>>>>> BTW: I would really like to see a PCMPSTR based implementation :)
>>>> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is
>>>> part of SSE4.2. How would you
>>>> deal with Intel core microarchitecture CPUs that don't have it?
More information about the fpc-devel
mailing list