[fpc-devel] Register renaming and false dependency question

Mon Oct 18 17:56:49 CEST 2021

None of the aforementioned c++ compilers does this optimization with either O2/3 or Os - see https://godbolt.org/z/vrPfd3xej (could be I was missing one of the many additional flags you can pass there though)

In the case of the 2 mov instructions, the CPU most likely uses register renaming for the second mov into rax to eliminate the write-after-read dependency. That means both mov instructions can happen at the same time. Using shr causes a dependency so it has to execute after the mov. I don't know though how that interfers with rax also being input of the imul instruction and when the shl can actually execute. You will have to profile this with something like https://github.com/travisdowns/uarch-bench or alike.

> On 18/10/2021 11:14 J. Gareth Moreton via fpc-devel <fpc-devel at lists.freepascal.org> wrote:
> 
>  
> What I'm figuring is that it should be fine for -Os, since that 
> optimises for size over speed, and -Os uses "or $-1, %rax" for setting a 
> register to -1 instead of "mov $-1, %rax".  I copied this behaviour from 
> MSVC after finding a thread about the false dependency that OR causes 
> (Florian warned me of the false dependency when I proposed the 
> optimisation regardless of size or speed).
> 
> Outside of -Os, if the immediate previous instruction uses the same 
> register, then it's guaranteed to have been loaded and there shouldn't 
> be a pipeline stall, although out-of-order execution may cause some 
> problems.  This one is going to be interesting to time.
> 
> In regards to how frequently it appears in code, if it turns out that 
> it's very infrequent (normally I compile and analyse the RTL for this 
> purpose, and sometimes the compiler itself), I likely won't code it, 
> although in the case of the "fast mod" algorithm, it appears very 
> frequently, so should probably be coded directly, especially as the 
> multiplication can take 3 cycles, depending on the processor, and is the 
> limiting factor.
> 
> Gareth aka. Kit
> 
> On 17/10/2021 23:53, Stefan Glienke via fpc-devel wrote:
> > According to compiler explorer clang, gcc and msvc compile this to the 
> > same code with -O3 as FPC does. So I would assume that is fine.
> >
> > Am 17.10.2021 um 13:25 schrieb J. Gareth Moreton via fpc-devel:
> >> Hi everyone,
> >>
> >> While reading up on some algorithms, I came across a recommendation 
> >> of using a shorter arithmetic function to change the value of a 
> >> constant in a register rather than loading the new value directly.  
> >> However, the algorithm assumes a RISC-like processor, so I'm not sure 
> >> if it applies to an Intel x86-64 processor.  Consider the following:
> >>
> >> movq $0xaaaaaaaaaaaaaaab,%rax
> >> imulq   %rax,%rcx
> >> movq $0x5555555555555555,%rax
> >> cmpq    %rax,%rcx
> >> setle  %al
> >>
> >> This algorithm sets %al to 1 if %rcx is divisible by 3, and 0 if it's 
> >> not, and was compiled from the following Pascal code (under -O3, but 
> >> -O1 produces almost exactly the same):
> >>
> >> function IsDivisible3(Numerator: QWord): Boolean;
> >> begin
> >>   Result := (Numerator * $AAAAAAAAAAAAAAAB) <= $5555555555555555;
> >> end;
> >>
> >> (One of my merge requests produces this code from "Result := (x mod 
> >> 3) = 0")
> >>
> >> My question is this: can "movq $0x5555555555555555,%rax" be replaced 
> >> with "shrq $0x1,%rax" without incurring an additional pipeline 
> >> stall?  The MOV instruction takes 10 bytes to store, while "SHR 1" 
> >> takes only 3.  Given that %rax is used beforehand and the CMP 
> >> instruction has to wait until the IMUL instruction has finished 
> >> executing, logic tells me that I can get away with it here, but I'm 
> >> not sure if the metric to go by is the execution speed of IMUL (i.e. 
> >> the IMUL instruction is the limiting factor before CMP can be 
> >> executed), or the simple fact that the previous value of %rax was 
> >> used and will be loaded with $AAAAAAAAAAAAAAAB by the time it comes 
> >> to load it with a new value.
> >>
> >> Gareth aka. Kit
> >>
> >>
> > _______________________________________________
> > fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> > https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> >
> 
> -- 
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
> 
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel