[fpc-devel] x86_64 question

Fri Oct 2 09:15:39 CEST 2020

Ah brilliant, thank you.

I have used Agner Fog's material before for cycle counting.  When I 
implemented my 3 MOV -> XCHG optimisation 
(https://bugs.freepascal.org/view.php?id=36511), I used Agner Fog's 
empirical results to determine when it's best to apply this optimisation 
where speed is concerned (on a lot of older processors, it's not worth 
it because XCHG took 3 cycles and the 3 MOVs generally took only 2 (due 
to how the dependency chain is set up).  Only when XCHG's cycle count 
dropped to 1 or 2, or when optimising for size, does it pay off.

So it looks like a partial read of the lower bits is absolutely fine, 
since you're not changing anything.

Gareth aka. Kit

On 02/10/2020 01:40, Nikolay Nikolov via fpc-devel wrote:
>
> On 10/1/20 11:36 PM, J. Gareth Moreton via fpc-devel wrote:
>> I thought that might be the case - thanks Nikolay.  And I meant to 
>> say lower bits of a REGISTER, not an instruction!
>>
>> Admittedly I'm cycle-counting and byte-counting again!  I was looking 
>> for ways to reduce 13 bytes of padding in one of my pure assembly 
>> language routines and realised I could make a saving there.  The only 
>> thing I can think of that I have to watch out for logically is if I 
>> change, say, TEST EAX, $80 to TEST AL, $80, the latter will set the 
>> sign flag if the most-significant bit is 1 after the 'and' operation) 
>> while the former always clears the sign flag.
>>
>> I have used such subregisters before in the FPC RTL, in fpc_int_real 
>> and fpc_frac_real in rtl/x86_64/math.inc, where I read AX instead of 
>> the larger RAX, but that's only after a call to "SHR RAX, 48" that 
>> guarantees that everything above the 16th bit is zero, and after 
>> testing other implementation candidates a kind of informal 
>> competition. (Surprisingly, I think "shr $48, %rax; and $0x7ff0,%ax; 
>> cmp $0x4330,%ax" runs faster than moving 64-bit constants into 
>> temporary registers (since 64-bit immediates aren't supported outside 
>> of MOV) and using 'and' and 'cmp' on %rax directly)
>>
>> I think you always get a read penalty when using the high-byte 
>> registers because the processor has to do an implicit shift operation.
>
> I don't remember the reason, but I recall reading they are less 
> efficient in Agner Fog's optimization manual. Here's the relevant quote:
>
> "Any use of the high 8-bit registers AH, BH, CH, DH should be avoided 
> because it can cause false dependences and less efficient code."
>
> It's from the chapter "Partial registers" (page 61) of this document:
>
> https://www.agner.org/optimize/optimizing_assembly.pdf
>
> Highly recommended reading, as it addresses exactly the topic of 
> partial registers. In general, it is the partial register writes of 
> 16-bit or 8-bit subregisters that cause problems - either false read 
> dependencies (usually on AMD) or extra penalties for joining/splitting 
> registers (on Intel, at least in the P6 era).
>
> Best regards,
>
> Nikolay
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>