[fpc-devel] x86_64 question

Fri Oct 2 13:13:32 CEST 2020

Confirmed my suspicions.  if I zero the upper bits of the register (I 
used something akin to "AND RCX, $F"), there is no speed loss.

Therefore, I can make the hypothesis, on my Intel(R) Core(TM) i7-10750H, 
that using TEST on a sub-register causes a false dependency if the bits 
outside of the subset are not zero, even though the register isn't being 
modified.

Gareth aka. Kit

On 02/10/2020 11:57, J. Gareth Moreton via fpc-devel wrote:
> So... I've done some tests, replacing TEST RCX, $4 with TEST CL, $4 
> and the like in a number-crunching function, and it seems to cause a 
> notable penalty, even though none of the instructions are in my 
> critical loop.  So I think it's something that needs to be avoided in 
> most cases.  I think the reason why it worked in my Int and Frac 
> functions is because the processor knows the upper 48 bits of the 
> register are zero.
>
> Long story short... best not to do it unless you have some additional 
> insight into what the registers contain.
>
> Gareth aka. Kit
>
>
> On 02/10/2020 08:15, J. Gareth Moreton via fpc-devel wrote:
>> Ah brilliant, thank you.
>>
>> I have used Agner Fog's material before for cycle counting. When I 
>> implemented my 3 MOV -> XCHG optimisation 
>> (https://bugs.freepascal.org/view.php?id=36511), I used Agner Fog's 
>> empirical results to determine when it's best to apply this 
>> optimisation where speed is concerned (on a lot of older processors, 
>> it's not worth it because XCHG took 3 cycles and the 3 MOVs generally 
>> took only 2 (due to how the dependency chain is set up).  Only when 
>> XCHG's cycle count dropped to 1 or 2, or when optimising for size, 
>> does it pay off.
>>
>> So it looks like a partial read of the lower bits is absolutely fine, 
>> since you're not changing anything.
>>
>> Gareth aka. Kit
>>
>> On 02/10/2020 01:40, Nikolay Nikolov via fpc-devel wrote:
>>>
>>> On 10/1/20 11:36 PM, J. Gareth Moreton via fpc-devel wrote:
>>>> I thought that might be the case - thanks Nikolay.  And I meant to 
>>>> say lower bits of a REGISTER, not an instruction!
>>>>
>>>> Admittedly I'm cycle-counting and byte-counting again!  I was 
>>>> looking for ways to reduce 13 bytes of padding in one of my pure 
>>>> assembly language routines and realised I could make a saving 
>>>> there.  The only thing I can think of that I have to watch out for 
>>>> logically is if I change, say, TEST EAX, $80 to TEST AL, $80, the 
>>>> latter will set the sign flag if the most-significant bit is 1 
>>>> after the 'and' operation) while the former always clears the sign 
>>>> flag.
>>>>
>>>> I have used such subregisters before in the FPC RTL, in 
>>>> fpc_int_real and fpc_frac_real in rtl/x86_64/math.inc, where I read 
>>>> AX instead of the larger RAX, but that's only after a call to "SHR 
>>>> RAX, 48" that guarantees that everything above the 16th bit is 
>>>> zero, and after testing other implementation candidates a kind of 
>>>> informal competition. (Surprisingly, I think "shr $48, %rax; and 
>>>> $0x7ff0,%ax; cmp $0x4330,%ax" runs faster than moving 64-bit 
>>>> constants into temporary registers (since 64-bit immediates aren't 
>>>> supported outside of MOV) and using 'and' and 'cmp' on %rax directly)
>>>>
>>>> I think you always get a read penalty when using the high-byte 
>>>> registers because the processor has to do an implicit shift operation.
>>>
>>> I don't remember the reason, but I recall reading they are less 
>>> efficient in Agner Fog's optimization manual. Here's the relevant 
>>> quote:
>>>
>>> "Any use of the high 8-bit registers AH, BH, CH, DH should be 
>>> avoided because it can cause false dependences and less efficient 
>>> code."
>>>
>>> It's from the chapter "Partial registers" (page 61) of this document:
>>>
>>> https://www.agner.org/optimize/optimizing_assembly.pdf
>>>
>>> Highly recommended reading, as it addresses exactly the topic of 
>>> partial registers. In general, it is the partial register writes of 
>>> 16-bit or 8-bit subregisters that cause problems - either false read 
>>> dependencies (usually on AMD) or extra penalties for 
>>> joining/splitting registers (on Intel, at least in the P6 era).
>>>
>>> Best regards,
>>>
>>> Nikolay
>>>
>>> _______________________________________________
>>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>>
>> _______________________________________________
>> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
>> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>>
>

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus