[fpc-devel] x86_64.inc CompareByte

Mon Oct 23 00:25:28 CEST 2017

I used 2 different benchmarks. One for (very) short buffers [1] and one
for rather large buffers [2].

[1]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   key  := 'A';
   key2 := 'A';
   for i:= 0 to 1000000000 do begin
     res := CompareByte(key[1], key2[1], Length(key));
   end;
end.

[2]:
var
   key, key2: string;
   res: LongWord;
   i: SizeInt;

begin
   SetLength(key,10240 * 1024);
   SetLength(key2,10240 * 1024);
   for i:= 0 to 10000 do begin
     hash := CompareByte_RTL(key[1], key2[1], Length(key));
   end;
end.

The measurement takes place on a Intel Core i5 CPU M520 at 2.40GHz which
has a Westmere Microarchitecture. The programs are run on an otherwise
idle Linux (OpenSuse Tumbleweed) system via

perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte

after the following setup:
  cpupower frequency-set -g performance
  echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)

The output for [1] using the current RTL CompareByte is:
  11.336.449.124   cycles                      ( +-  0,05% )
  28.077.280.776   instructions   #   2,48  insn per cycle ( +-  0,00% )
   4,736782553 seconds time elapsed            ( +-  0,05% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
  10.293.397.316   cycles                      ( +-  0,01% )
  26.070.305.490   instructions   #   2,53  insn per cycle ( +-  0,00% )
   4,301081734 seconds time elapsed            ( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
325.526.707.243   cycles                      ( +-  0,31% )
736.237.912.850   instructions   #   2,26  insn per cycle ( +-  0,00% )
136,013215979 seconds time elapsed            ( +-  0,31% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
224.621.009.410   cycles                      ( +-  0,95% )
525.832.575.056   instructions   #   2,34  insn per cycle ( +-  0,00% )
  93,851685247 seconds time elapsed            ( +-  0,95% )

I hopefully can come up with the corresponding numbers for a ivy bridge
CPU tomorrow.

On 22.10.2017 20:55, Florian Klämpfl wrote:
> Am 21.10.2017 um 01:24 schrieb Markus Beth:
>> Find attached the already announced version of CompareByte.
>>
> 
> What benchmark did you use? In my tests it is slightly slower than that one of fpc 3.0.x?
> 
> I used the following test program:
> 
> var
>    buf1,buf2 : array[0..127] of byte;
>    pos,len,i,j : longint;
> 
> begin
>    for i:=1 to 100 do
>      begin
>        len:=random(100);
>        for j:=0 to len-1 do
>          begin
>            buf1[j]:=random(256);
>            buf2[j]:=random(256);
>          end;
> 
>        for j:=0 to random(10) do
>          buf2[j]:=buf1[j];
> 
>        for j:=1 to 1000000 do
>          CompareByte(buf1,buf2,len);
>      end;
> end.
> 
>>
>>
>> On 16.10.2017 23:08, Markus Beth wrote:
>>> On 16.10.2017 22:41, Florian Klämpfl wrote:
>>>>> P.S.: I am currently working on another version of CompareByte that might have a slightly higher
>>>>> latency for very small len but a higher throughput (2 cycles per iteration vs. 3 cycles on an Intel
>>>>> Arrandale CPU (Westmere microarchitecture)). But this would need some more testing and
>>>>> benchmarking.
>>>>> I can come up with it here again if this would be of any interest.
>>>>
>>>> Small lengths in terms of matching string or overall lengths?
>>>
>>> It is small length in terms of matching string as there is some setup work before the loop.
>>>
>>>> BTW: I would really like to see a PCMPSTR based implementation :)
>>> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is part of SSE4.2. How would you
>>> deal with Intel core microarchitecture CPUs that don't have it?