[fpc-devel] x86_64 question
Nikolay Nikolov
nickysn at gmail.com
Fri Oct 2 02:40:15 CEST 2020
On 10/1/20 11:36 PM, J. Gareth Moreton via fpc-devel wrote:
> I thought that might be the case - thanks Nikolay. And I meant to say
> lower bits of a REGISTER, not an instruction!
>
> Admittedly I'm cycle-counting and byte-counting again! I was looking
> for ways to reduce 13 bytes of padding in one of my pure assembly
> language routines and realised I could make a saving there. The only
> thing I can think of that I have to watch out for logically is if I
> change, say, TEST EAX, $80 to TEST AL, $80, the latter will set the
> sign flag if the most-significant bit is 1 after the 'and' operation)
> while the former always clears the sign flag.
>
> I have used such subregisters before in the FPC RTL, in fpc_int_real
> and fpc_frac_real in rtl/x86_64/math.inc, where I read AX instead of
> the larger RAX, but that's only after a call to "SHR RAX, 48" that
> guarantees that everything above the 16th bit is zero, and after
> testing other implementation candidates a kind of informal
> competition. (Surprisingly, I think "shr $48, %rax; and $0x7ff0,%ax;
> cmp $0x4330,%ax" runs faster than moving 64-bit constants into
> temporary registers (since 64-bit immediates aren't supported outside
> of MOV) and using 'and' and 'cmp' on %rax directly)
>
> I think you always get a read penalty when using the high-byte
> registers because the processor has to do an implicit shift operation.
I don't remember the reason, but I recall reading they are less
efficient in Agner Fog's optimization manual. Here's the relevant quote:
"Any use of the high 8-bit registers AH, BH, CH, DH should be avoided
because it can cause false dependences and less efficient code."
It's from the chapter "Partial registers" (page 61) of this document:
https://www.agner.org/optimize/optimizing_assembly.pdf
Highly recommended reading, as it addresses exactly the topic of partial
registers. In general, it is the partial register writes of 16-bit or
8-bit subregisters that cause problems - either false read dependencies
(usually on AMD) or extra penalties for joining/splitting registers (on
Intel, at least in the P6 era).
Best regards,
Nikolay
More information about the fpc-devel
mailing list