[fpc-devel] x86_64.inc CompareByte

Sun Oct 29 23:18:28 CET 2017

Am 23.10.2017 um 22:58 schrieb Markus Beth:
> Here are the numbers for on ivy bridge CPU:
> The output for [1] using the current RTL CompareByte is:
>   9.001.275.281   cycles:u                    ( +-  0,00% )
>  28.000.560.462   instructions:u #   3,11  insn per cycle ( +-  0,00% )
>   2,654735815 seconds time elapsed            ( +-  0,00% )
> 
> The output for [1] using the x86_64_comparebyte3.patch CompareByte is:
>   9.002.038.628   cycles:u                    ( +-  0,01% )
>  26.000.559.441   instructions:u #   2,89  insn per cycle ( +-  0,00% )
>   2,655002891 seconds time elapsed            ( +-  0,01% )
> 
> The output for [2] using the current RTL CompareByte is:
> 227.941.173.371   cycles:u                    ( +-  0,00% )
> 734.077.388.160   instructions:u #   3,22  insn per cycle ( +-  0,00% )
>  67,215188648 seconds time elapsed            ( +-  0,00% )
> 
> The output for [2] using the x86_64_comparebyte3.patch CompareByte is:
> 210.694.292.040   cycles:u                    ( +-  0,00% )
> 524.341.215.569   instructions:u #   2,49  insn per cycle ( +-  0,00% )
>  62,129294243 seconds time elapsed            ( +-  0,00% )
> 
> 
> With Florian's benchmark I also observe that the patched version is
> slightly slower than the original. But I have no idea why this is so.

I have committed your lastest patch with a few changes: the loop entry is aligned now to 16 bytes, I
used movb instead of movbzl and inc instead of add. For me (Haswell CPU) this works better. I think
also these changes are better on average.