[fpc-devel] Discussion on a particular optimisation development (WARNING: Technical!)
Nikolay Nikolov
nickysn at gmail.com
Wed Feb 3 08:57:51 CET 2021
On 2/2/21 11:06 PM, J. Gareth Moreton via fpc-devel wrote:
> Hi everyone,
>
> I've found a potential optimisation for conditions of the form "(x <>
> 0) and (y <> 0)", which are very common because this is semantically
> equivalent to "Assigned(x) and Assigned(y)", for example, and such a
> construct is generated implicity in TObject.Destroy, for example. The
> node tree for TObject.Destroy example (after the first pass) is as
> follows (it checks the value of Self and VMT):
>
> The node tree inside the if-node's condition branch is as follows:
>
> <andn resultdef="Boolean" pos="329,9" flags="nf_pass1_done"
> complexity="3">
> <unequaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done"
> complexity="1">
> <loadn resultdef="TObject" pos="329,9" flags="nf_pass1_done"
> complexity="0">
> <symbol>self</symbol>
> </loadn>
> <niln resultdef="TObject" pos="329,9" flags="nf_pass1_done"
> complexity="0">
> </niln>
> </unequaln>
> <unequaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done"
> complexity="1">
> <typeconvn resultdef="$char_pointer" pos="329,9"
> flags="nf_pass1_done,nf_explicit,nf_internal" complexity="0"
> convtype="tc_equal">
> <typeconvn resultdef="Pointer" pos="329,9"
> flags="nf_pass1_done" complexity="0" convtype="tc_equal">
> <loadn resultdef="<no type symbol>" pos="329,9"
> flags="nf_pass1_done" complexity="0">
> <symbol>vmt</symbol>
> </loadn>
> </typeconvn>
> </typeconvn>
> <pointerconstn resultdef="$char_pointer" pos="329,9"
> flags="nf_pass1_done,nf_explicit" complexity="0">
> <value>$0000000000000000</value>
> </pointerconstn>
> </unequaln>
> </andn>
>
> On x86-64_win64, the following assembly language is generated (smart
> pipelining and out-of-order execution MIGHT permit it to be executed
> in 3 cycles, but it's more likely going to take 4 or 5):
>
> testq %rbx,%rbx
> setneb %al
> testq %rsi,%rsi
> setneb %dl
> andb %dl,%al
>
> DeMorgan's Laws for 2 inputs state that "not (A or B) = not A and not
> B", and "not (A and B) = not A or not B", and using these, we can
> develop much more efficient assembly language (which will almost
> certainly take 3 cycles to run, and is much smaller):
>
> movq %rbx,%rdx
> orq %rsi,%rdx
> seteb %al
>
> For this particular routine, %rsi and %dl/%rdx are not used
> afterwards, and can be simplified further (this bit will probably be a
> peephole optimisation), dropping the cycle count to 2:
>
> orq %rbx,%rsi
> seteb %al
>
> In this situation it is safe because the comparison values are already
> in registers and code generation has permitted the bypassing of
> Boolean short-circuiting.
>
> Rather than write an entire peephole optimisation, I would ideally
> like to program this optimisation at the nodal level, and permit the
> compiler to convert it into something resembing the following ("andn"
> becomes "notn" -> "orn", and the two "unequaln" nodes become "equaln"):
>
> <notn resultdef="Boolean" pos="329,9" flags="nf_pass1_done"
> complexity="3">
> <orn resultdef="Boolean" pos="329,9" flags="nf_pass1_done"
> complexity="3">
> <equaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done"
> complexity="1">
> <loadn resultdef="TObject" pos="329,9" flags="nf_pass1_done"
> complexity="0">
> <symbol>self</symbol>
> </loadn>
> <niln resultdef="TObject" pos="329,9" flags="nf_pass1_done"
> complexity="0">
> </niln>
> </equaln>
> <equaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done"
> complexity="1">
> <typeconvn resultdef="$char_pointer" pos="329,9"
> flags="nf_pass1_done,nf_explicit,nf_internal" complexity="0"
> convtype="tc_equal">
> <typeconvn resultdef="Pointer" pos="329,9"
> flags="nf_pass1_done" complexity="0" convtype="tc_equal">
> <loadn resultdef="<no type symbol>" pos="329,9"
> flags="nf_pass1_done" complexity="0">
> <symbol>vmt</symbol>
> </loadn>
> </typeconvn>
> </typeconvn>
> <pointerconstn resultdef="$char_pointer" pos="329,9"
> flags="nf_pass1_done,nf_explicit" complexity="0">
> <value>$0000000000000000</value>
> </pointerconstn>
> </equaln>
> </orn>
> </notn>
>
> ...but there is a catch. In terms of the raw nodes, converting the
> "andn" node into a "notn" and an "orn" node results in a more complex
> node tree and hence less efficient assembly language in general,
> unless a particular "pass_generate_code" routine knows to look out for
> the set-up of a logical "or" that combines two variables' comparison
> against zero, like that shown above.
>
> My question is... how should it be developed so that the node
> optimisation is performed on platforms that have the necessary
> assembly instructions, like x86_64 and AArch64 (once I develop them),
> but also not perform the optimisation on platforms that don't have the
> necessary instructions or checks in "pass_generate_code"? Allowing
> the optimisation wouldn't cause bad code generation, but it would be
> more inefficient in the general case.
>
> I hope I explained that okay.
You can introduce a new pass in the compiler, that does the node
transformation only on platforms that benefit from it. See e.g. what I
did a few years ago in optloadmodifystore.pas in the compiler sources.
This is an optimization that introduces new nodes that generate faster
code on x86 for patterns, such as:
a := a + b
Which can be done with a single 'add mem, const' or 'add mem, reg'
instruction on all processors in the x86 family (i8086, i386 and
x86_64). Also, I think there were other CPUs that benefit from this
optimization (m68k?). For that, I introduced new inline compiler nodes,
that are only used for this optimization.
>
> (Other possibilities include introducing new node types "norn" and
> "nandn" (and maybe "nxorn" to complete the set), and having the "andn"
> node above be converted into "norn", so that they can be instantly
> converted into the relevant assembly instructions instead of relying
> on multiple steps of optimistion)
Introducing norn and nand for optimization purposes might not be a bad
idea, considering there are some CPU architectures that also have nor
and nand instructions (powerpc, for example).
Best regards,
Nikolay
P.S. I don't think we need "WARNING: Technical!" trigger warnings on
fpc-devel ;-)
More information about the fpc-devel
mailing list