[fpc-devel] Policy regarding SHL/SHR under x86
J. Gareth Moreton
gareth at moreton-family.com
Tue Oct 25 14:49:11 CEST 2022
Correction to last post. When applying BZHI to an input ("Result :=
Input and ((1 shl x) - 1)"), the initial "mov $-1,%eax" is unnecessary
unless the mask is being preserved, and is just:
bzhil %ecx,(ref-to-Input),%eax
Kit
On 25/10/2022 13:44, J. Gareth Moreton via fpc-devel wrote:
> What I want to do is the following...
>
> Say I have the expression "(1 shl x) - 1"... under the default AMD
> Athlon optimisations, you might get something like this:
>
> (x in %cl, Result in %eax)
> movl $1,%eax
> shll %cl,%eax
> subl $1,%eax
>
> Under -CpCOREAVX2, you might get this (ignoring any zero-extensions
> required on the index):
>
> (x in %ecx, Result in %eax)
> movl $1,%eax
> shlxl %ecx,%eax,%eax
> subl $1,%eax
>
> All of these sequences take at least 3 cycles, or more accurately,
> have a dependency chain of length 3. Now consider using BZHI:
>
> (x in %ecx, Result in %eax)
> movl $-1,%eax
> bzhil %ecx,%eax,%eax
>
> A dependency chain length of 2 (I'm not sure how many cycles it takes
> for BZHI to complete execution).
>
> The savings go further if this result is used as a mask then
> discarded, i.e. "Result := Input and ((1 shl x) - 1)". Under AMD
> Athlon, for example:
>
> (x in %cl, Input in %edx, Result in %eax)
> movl $1,%eax
> shll %cl,%eax
> subl $1,%eax
> andl %edx,%eax
>
> Under -CpCOREAVX2:
>
> (x in %ecx, Input in %edx, Result in %eax)
> movl $1,%eax
> shlxl %ecx,%eax,%eax
> subl $1,%eax
> andl %edx,%eax
>
> All have a dependency chain length of 4. But with BZHI:
>
> (x in %ecx, Input in %edx, Result in %eax)
> movl $-1,%eax
> bzhil %ecx,%edx,%eax
>
> Once again, the dependency chain length is reduced to 2. Like with
> the earlier two, this sequence also works if Input is a reference
> rather than a register; e.g.
>
> (x in %ecx, Result in %eax)
> movl $-1,%eax
> bzhil %ecx,(ref-to-Input),%eax
>
> A problem, however, arises if the index x is out of range. In the
> case of 32-bit operands, the shift instructions in x86 and ARM
> (including AArch64) essentially reduce the index modulo 32. "(1 shl
> 32) - 1" is often expected to return $FFFFFFFF, a mask that covers the
> entire bitrange, but "1 shl 32" returns 1 in this case, so the
> resultant mask ends up being all zeroes. However, with BZHI, if the
> index is out of range, the carry flag is set and the output (%eax in
> this case) is set equal to the input, which results in "(1 shl 32) -
> 1" returning $FFFFFFFF. I think the same thing happens with negative
> indices, since BZHI is essentially unsigned (it also only reads the
> least significant byte of the index register). With this in mind, and
> "1 shl 32" being considered undefined for 32-bit operands, is this an
> acceptable optimisation?
>
> Kit
>
> P.S. There is code in the compiler that catches undefined bitmasks and
> simply sets it to all ones if the index is 32 or 64 or whatever the
> integer word size is. If BZHI is used, a peephole or node
> optimisation can be used to eliminate this catch since it becomes
> unnecessary with BZHI.
>
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
More information about the fpc-devel
mailing list