[fpc-devel] Discussion on a particular optimisation development (WARNING: Technical!)

Wed Feb 3 08:57:51 CET 2021

On 2/2/21 11:06 PM, J. Gareth Moreton via fpc-devel wrote:
> Hi everyone,
>
> I've found a potential optimisation for conditions of the form "(x <> 
> 0) and (y <> 0)", which are very common because this is semantically 
> equivalent to "Assigned(x) and Assigned(y)", for example, and such a 
> construct is generated implicity in TObject.Destroy, for example.  The 
> node tree for TObject.Destroy example (after the first pass) is as 
> follows (it checks the value of Self and VMT):
>
> The node tree inside the if-node's condition branch is as follows:
>
> <andn resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
> complexity="3">
>    <unequaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
> complexity="1">
>       <loadn resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
> complexity="0">
>          <symbol>self</symbol>
>       </loadn>
>       <niln resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
> complexity="0">
>       </niln>
>    </unequaln>
>    <unequaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
> complexity="1">
>       <typeconvn resultdef="$char_pointer" pos="329,9" 
> flags="nf_pass1_done,nf_explicit,nf_internal" complexity="0" 
> convtype="tc_equal">
>          <typeconvn resultdef="Pointer" pos="329,9" 
> flags="nf_pass1_done" complexity="0" convtype="tc_equal">
>             <loadn resultdef="<no type symbol>" pos="329,9" 
> flags="nf_pass1_done" complexity="0">
>                <symbol>vmt</symbol>
>             </loadn>
>          </typeconvn>
>       </typeconvn>
>       <pointerconstn resultdef="$char_pointer" pos="329,9" 
> flags="nf_pass1_done,nf_explicit" complexity="0">
>          <value>$0000000000000000</value>
>       </pointerconstn>
>    </unequaln>
> </andn>
>
> On x86-64_win64, the following assembly language is generated (smart 
> pipelining and out-of-order execution MIGHT permit it to be executed 
> in 3 cycles, but it's more likely going to take 4 or 5):
>
>     testq    %rbx,%rbx
>     setneb   %al
>     testq    %rsi,%rsi
>     setneb   %dl
>     andb     %dl,%al
>
> DeMorgan's Laws for 2 inputs state that "not (A or B) = not A and not 
> B", and "not (A and B) = not A or not B", and using these, we can 
> develop much more efficient assembly language (which will almost 
> certainly take 3 cycles to run, and is much smaller):
>
>     movq     %rbx,%rdx
>     orq      %rsi,%rdx
>     seteb    %al
>
> For this particular routine, %rsi and %dl/%rdx are not used 
> afterwards, and can be simplified further (this bit will probably be a 
> peephole optimisation), dropping the cycle count to 2:
>
>     orq      %rbx,%rsi
>     seteb    %al
>
> In this situation it is safe because the comparison values are already 
> in registers and code generation has permitted the bypassing of 
> Boolean short-circuiting.
>
> Rather than write an entire peephole optimisation, I would ideally 
> like to program this optimisation at the nodal level, and permit the 
> compiler to convert it into something resembing the following ("andn" 
> becomes "notn" -> "orn", and the two "unequaln" nodes become "equaln"):
>
> <notn resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
> complexity="3">
>    <orn resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
> complexity="3">
>       <equaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
> complexity="1">
>          <loadn resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
> complexity="0">
>             <symbol>self</symbol>
>          </loadn>
>          <niln resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
> complexity="0">
>          </niln>
>       </equaln>
>       <equaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
> complexity="1">
>          <typeconvn resultdef="$char_pointer" pos="329,9" 
> flags="nf_pass1_done,nf_explicit,nf_internal" complexity="0" 
> convtype="tc_equal">
>             <typeconvn resultdef="Pointer" pos="329,9" 
> flags="nf_pass1_done" complexity="0" convtype="tc_equal">
>                <loadn resultdef="<no type symbol>" pos="329,9" 
> flags="nf_pass1_done" complexity="0">
>                   <symbol>vmt</symbol>
>                </loadn>
>             </typeconvn>
>          </typeconvn>
>          <pointerconstn resultdef="$char_pointer" pos="329,9" 
> flags="nf_pass1_done,nf_explicit" complexity="0">
>             <value>$0000000000000000</value>
>          </pointerconstn>
>       </equaln>
>    </orn>
> </notn>
>
> ...but there is a catch.  In terms of the raw nodes, converting the 
> "andn" node into a "notn" and an "orn" node results in a more complex 
> node tree and hence less efficient assembly language in general, 
> unless a particular "pass_generate_code" routine knows to look out for 
> the set-up of a logical "or" that combines two variables' comparison 
> against zero, like that shown above.
>
> My question is... how should it be developed so that the node 
> optimisation is performed on platforms that have the necessary 
> assembly instructions, like x86_64 and AArch64 (once I develop them), 
> but also not perform the optimisation on platforms that don't have the 
> necessary instructions or checks in "pass_generate_code"?  Allowing 
> the optimisation wouldn't cause bad code generation, but it would be 
> more inefficient in the general case.
>
> I hope I explained that okay.

You can introduce a new pass in the compiler, that does the node 
transformation only on platforms that benefit from it. See e.g. what I 
did a few years ago in optloadmodifystore.pas in the compiler sources. 
This is an optimization that introduces new nodes that generate faster 
code on x86 for patterns, such as:

a := a + b

Which can be done with a single 'add mem, const' or 'add mem, reg' 
instruction on all processors in the x86 family (i8086, i386 and 
x86_64). Also, I think there were other CPUs that benefit from this 
optimization (m68k?). For that, I introduced new inline compiler nodes, 
that are only used for this optimization.

>
> (Other possibilities include introducing new node types "norn" and 
> "nandn" (and maybe "nxorn" to complete the set), and having the "andn" 
> node above be converted into "norn", so that they can be instantly 
> converted into the relevant assembly instructions instead of relying 
> on multiple steps of optimistion)

Introducing norn and nand for optimization purposes might not be a bad 
idea, considering there are some CPU architectures that also have nor 
and nand instructions (powerpc, for example).

Best regards,

Nikolay

P.S. I don't think we need "WARNING: Technical!" trigger warnings on 
fpc-devel ;-)