[fpc-devel] Policy regarding SHL/SHR under x86

Tue Oct 25 14:49:11 CEST 2022

Correction to last post.  When applying BZHI to an input ("Result := 
Input and ((1 shl x) - 1)"), the initial "mov $-1,%eax" is unnecessary 
unless the mask is being preserved, and is just:

bzhil %ecx,(ref-to-Input),%eax

Kit

On 25/10/2022 13:44, J. Gareth Moreton via fpc-devel wrote:
> What I want to do is the following...
>
> Say I have the expression "(1 shl x) - 1"... under the default AMD 
> Athlon optimisations, you might get something like this:
>
> (x in %cl, Result in %eax)
> movl $1,%eax
> shll %cl,%eax
> subl $1,%eax
>
> Under -CpCOREAVX2, you might get this (ignoring any zero-extensions 
> required on the index):
>
> (x in %ecx, Result in %eax)
> movl  $1,%eax
> shlxl %ecx,%eax,%eax
> subl  $1,%eax
>
> All of these sequences take at least 3 cycles, or more accurately, 
> have a dependency chain of length 3.  Now consider using BZHI:
>
> (x in %ecx, Result in %eax)
> movl  $-1,%eax
> bzhil %ecx,%eax,%eax
>
> A dependency chain length of 2 (I'm not sure how many cycles it takes 
> for BZHI to complete execution).
>
> The savings go further if this result is used as a mask then 
> discarded, i.e. "Result := Input and ((1 shl x) - 1)".  Under AMD 
> Athlon, for example:
>
> (x in %cl, Input in %edx, Result in %eax)
> movl $1,%eax
> shll %cl,%eax
> subl $1,%eax
> andl %edx,%eax
>
> Under -CpCOREAVX2:
>
> (x in %ecx, Input in %edx, Result in %eax)
> movl  $1,%eax
> shlxl %ecx,%eax,%eax
> subl  $1,%eax
> andl  %edx,%eax
>
> All have a dependency chain length of 4.  But with BZHI:
>
> (x in %ecx, Input in %edx, Result in %eax)
> movl  $-1,%eax
> bzhil %ecx,%edx,%eax
>
> Once again, the dependency chain length is reduced to 2.  Like with 
> the earlier two, this sequence also works if Input is a reference 
> rather than a register; e.g.
>
> (x in %ecx, Result in %eax)
> movl  $-1,%eax
> bzhil %ecx,(ref-to-Input),%eax
>
> A problem, however, arises if the index x is out of range.  In the 
> case of 32-bit operands, the shift instructions in x86 and ARM 
> (including AArch64) essentially reduce the index modulo 32.  "(1 shl 
> 32) - 1" is often expected to return $FFFFFFFF, a mask that covers the 
> entire bitrange, but "1 shl 32" returns 1 in this case, so the 
> resultant mask ends up being all zeroes.  However, with BZHI, if the 
> index is out of range, the carry flag is set and the output (%eax in 
> this case) is set equal to the input, which results in "(1 shl 32) - 
> 1" returning $FFFFFFFF.  I think the same thing happens with negative 
> indices, since BZHI is essentially unsigned (it also only reads the 
> least significant byte of the index register).  With this in mind, and 
> "1 shl 32" being considered undefined for 32-bit operands, is this an 
> acceptable optimisation?
>
> Kit
>
> P.S. There is code in the compiler that catches undefined bitmasks and 
> simply sets it to all ones if the index is 32 or 64 or whatever the 
> integer word size is.  If BZHI is used, a peephole or node 
> optimisation can be used to eliminate this catch since it becomes 
> unnecessary with BZHI.
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>