[fpc-devel] x86_64.inc CompareByte

Mon Oct 16 22:33:13 CEST 2017

Sorry for the late reply. I had a weekend off(line).

The instructions were chosen on purpose and Sergey already cited the 
part of the Intel documentation that explains why this is correct. You 
can find a similar part in AMD "AMD64 Architecture Programmer’s Manual 
Volume 1: Application Programming":

 > 3.4.5 High 32 Bits
 > In 64-bit mode, the following rules apply to extension of results into
 > the high 32 bits when results smaller than 64 bits are written:
 >
 > * Zero-Extension of 32-Bit Results: 32-bit results are zero-extended
 >   into the high 32 bits of 64-bit GPR destination registers.

I think other x86_64 CPU manufacturers also adhere to this rule as I 
know gcc also relies on this.

I generally prefer the instructions operating on 32 bit operands over 
those operating on 64 bit operands where appropriate because they are 
typically encoded in less bytes as they do not need a REX prefix.

I have updated the patch (attached) to include a code path for 
'oldbinutils' as Gareth suggested. In addition I switched the tails 
(.LCmpbyteZero and .LCmpbyteExitFast) as when we leave the loop because 
the loop count reaches zero, we know already that the last bytes were 
the same and do not need to subq them.

Markus

P.S.: I am currently working on another version of CompareByte that 
might have a slightly higher latency for very small len but a higher 
throughput (2 cycles per iteration vs. 3 cycles on an Intel Arrandale 
CPU (Westmere microarchitecture)). But this would need some more testing 
and benchmarking. I can come up with it here again if this would be of 
any interest.

On 16.10.2017 19:41, Сергей Сергеенко wrote:
> On 15 Oct 2017 Florian Klämpfl wrote:
>> I had a look and tested it and it worked, I didn't notice the problem below
>> either.
> 
> Sorry for wrong warning. I cannot provide any example where my suggestions
> are true. The reason for it is described on page Vol. 1 3-13 of Intel 64
> and IA-32 Architectures Software Developer's Manual:
> 
>> When in 64-bit mode, operand size determines the number of valid bits in
>> the destination general-purpose register:
>>
>> [...]
>>
>>   32-bit operands generate a 32-bit result, zero-extended to a 64-bit
>>   result in the destination general-purpose
>>   
>> [...]
> 
> So, instructions
>>      movzbl  (%rcx),%eax
> and
>>       movzbl  -1(%rdx),%ecx
> and
>>       xorl    %eax,%eax
> should put zero into 32 high bits of appropriate registers.
> 
>> I think also the final xor should be a xorq %rax,%rax, right?
> 
> As I said above xorl %eax, %eax should be enough.
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86_64_comparebyte2.patch
Type: text/x-patch
Size: 1090 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20171016/f93a8550/attachment.bin>