[fpc-devel] x86_64.inc CompareByte

Mon Oct 23 22:58:16 CEST 2017

Here are the numbers for on ivy bridge CPU:
The output for [1] using the current RTL CompareByte is:
   9.001.275.281   cycles:u                    ( +-  0,00% )
  28.000.560.462   instructions:u #   3,11  insn per cycle ( +-  0,00% )
   2,654735815 seconds time elapsed            ( +-  0,00% )

The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
   9.002.038.628   cycles:u                    ( +-  0,01% )
  26.000.559.441   instructions:u #   2,89  insn per cycle ( +-  0,00% )
   2,655002891 seconds time elapsed            ( +-  0,01% )

The output for [2] using the current RTL CompareByte is:
227.941.173.371   cycles:u                    ( +-  0,00% )
734.077.388.160   instructions:u #   3,22  insn per cycle ( +-  0,00% )
  67,215188648 seconds time elapsed            ( +-  0,00% )

The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
210.694.292.040   cycles:u                    ( +-  0,00% )
524.341.215.569   instructions:u #   2,49  insn per cycle ( +-  0,00% )
  62,129294243 seconds time elapsed            ( +-  0,00% )

With Florian's benchmark I also observe that the patched version is
slightly slower than the original. But I have no idea why this is so.

On 23.10.2017 00:25, Markus Beth wrote:
> I used 2 different benchmarks. One for (very) short buffers [1] and one
> for rather large buffers [2].
> 
> [1]:
> var
>    key, key2: string;
>    res: LongWord;
>    i: SizeInt;
> 
> begin
>    key  := 'A';
>    key2 := 'A';
>    for i:= 0 to 1000000000 do begin
>      res := CompareByte(key[1], key2[1], Length(key));
>    end;
> end.
> 
> [2]:
> var
>    key, key2: string;
>    res: LongWord;
>    i: SizeInt;
> 
> begin
>    SetLength(key,10240 * 1024);
>    SetLength(key2,10240 * 1024);
>    for i:= 0 to 10000 do begin
>      hash := CompareByte_RTL(key[1], key2[1], Length(key));
>    end;
> end.
> 
> 
> The measurement takes place on a Intel Core i5 CPU M520 at 2.40GHz which
> has a Westmere Microarchitecture. The programs are run on an otherwise
> idle Linux (OpenSuse Tumbleweed) system via
> 
> perf stat -e cycles -e instructions -r 3 taskset -c 1 ./comparebyte
> 
> after the following setup:
>   cpupower frequency-set -g performance
>   echo 0 > /sys/devices/system/cpu/cpufreq/boost (westmere)
>   echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo (ivy bridge)
> 
> 
> The output for [1] using the current RTL CompareByte is:
>   11.336.449.124   cycles                      ( +-  0,05% )
>   28.077.280.776   instructions   #   2,48  insn per cycle ( +-  0,00% )
>    4,736782553 seconds time elapsed            ( +-  0,05% )
> 
> The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
>   10.293.397.316   cycles                      ( +-  0,01% )
>   26.070.305.490   instructions   #   2,53  insn per cycle ( +-  0,00% )
>    4,301081734 seconds time elapsed            ( +-  0,01% )
> 
> The output for [2] using the current RTL CompareByte is:
> 325.526.707.243   cycles                      ( +-  0,31% )
> 736.237.912.850   instructions   #   2,26  insn per cycle ( +-  0,00% )
> 136,013215979 seconds time elapsed            ( +-  0,31% )
> 
> The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
> 224.621.009.410   cycles                      ( +-  0,95% )
> 525.832.575.056   instructions   #   2,34  insn per cycle ( +-  0,00% )
>   93,851685247 seconds time elapsed            ( +-  0,95% )
> 
> 
> I hopefully can come up with the corresponding numbers for a ivy bridge
> CPU tomorrow.
> 
> 
> On 22.10.2017 20:55, Florian Klämpfl wrote:
>> Am 21.10.2017 um 01:24 schrieb Markus Beth:
>>> Find attached the already announced version of CompareByte.
>>>
>>
>> What benchmark did you use? In my tests it is slightly slower than 
>> that one of fpc 3.0.x?
>>
>> I used the following test program:
>>
>> var
>>    buf1,buf2 : array[0..127] of byte;
>>    pos,len,i,j : longint;
>>
>> begin
>>    for i:=1 to 100 do
>>      begin
>>        len:=random(100);
>>        for j:=0 to len-1 do
>>          begin
>>            buf1[j]:=random(256);
>>            buf2[j]:=random(256);
>>          end;
>>
>>        for j:=0 to random(10) do
>>          buf2[j]:=buf1[j];
>>
>>        for j:=1 to 1000000 do
>>          CompareByte(buf1,buf2,len);
>>      end;
>> end.
>>
>>>
>>>
>>> On 16.10.2017 23:08, Markus Beth wrote:
>>>> On 16.10.2017 22:41, Florian Klämpfl wrote:
>>>>>> P.S.: I am currently working on another version of CompareByte 
>>>>>> that might have a slightly higher
>>>>>> latency for very small len but a higher throughput (2 cycles per 
>>>>>> iteration vs. 3 cycles on an Intel
>>>>>> Arrandale CPU (Westmere microarchitecture)). But this would need 
>>>>>> some more testing and
>>>>>> benchmarking.
>>>>>> I can come up with it here again if this would be of any interest.
>>>>>
>>>>> Small lengths in terms of matching string or overall lengths?
>>>>
>>>> It is small length in terms of matching string as there is some 
>>>> setup work before the loop.
>>>>
>>>>> BTW: I would really like to see a PCMPSTR based implementation :)
>>>> PCMPSTR is (at the moment) out of my scope. I thought PCMPSTR is 
>>>> part of SSE4.2. How would you
>>>> deal with Intel core microarchitecture CPUs that don't have it?