[fpc-devel] Discussion on a particular optimisation development (WARNING: Technical!)

Tue Feb 2 22:06:11 CET 2021

Hi everyone,

I've found a potential optimisation for conditions of the form "(x <> 0) 
and (y <> 0)", which are very common because this is semantically 
equivalent to "Assigned(x) and Assigned(y)", for example, and such a 
construct is generated implicity in TObject.Destroy, for example.  The 
node tree for TObject.Destroy example (after the first pass) is as 
follows (it checks the value of Self and VMT):

The node tree inside the if-node's condition branch is as follows:

<andn resultdef="Boolean" pos="329,9" flags="nf_pass1_done" complexity="3">
    <unequaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
complexity="1">
       <loadn resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
complexity="0">
          <symbol>self</symbol>
       </loadn>
       <niln resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
complexity="0">
       </niln>
    </unequaln>
    <unequaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
complexity="1">
       <typeconvn resultdef="$char_pointer" pos="329,9" 
flags="nf_pass1_done,nf_explicit,nf_internal" complexity="0" 
convtype="tc_equal">
          <typeconvn resultdef="Pointer" pos="329,9" 
flags="nf_pass1_done" complexity="0" convtype="tc_equal">
             <loadn resultdef="<no type symbol>" pos="329,9" 
flags="nf_pass1_done" complexity="0">
                <symbol>vmt</symbol>
             </loadn>
          </typeconvn>
       </typeconvn>
       <pointerconstn resultdef="$char_pointer" pos="329,9" 
flags="nf_pass1_done,nf_explicit" complexity="0">
          <value>$0000000000000000</value>
       </pointerconstn>
    </unequaln>
</andn>

On x86-64_win64, the following assembly language is generated (smart 
pipelining and out-of-order execution MIGHT permit it to be executed in 
3 cycles, but it's more likely going to take 4 or 5):

     testq    %rbx,%rbx
     setneb   %al
     testq    %rsi,%rsi
     setneb   %dl
     andb     %dl,%al

DeMorgan's Laws for 2 inputs state that "not (A or B) = not A and not 
B", and "not (A and B) = not A or not B", and using these, we can 
develop much more efficient assembly language (which will almost 
certainly take 3 cycles to run, and is much smaller):

     movq     %rbx,%rdx
     orq      %rsi,%rdx
     seteb    %al

For this particular routine, %rsi and %dl/%rdx are not used afterwards, 
and can be simplified further (this bit will probably be a peephole 
optimisation), dropping the cycle count to 2:

     orq      %rbx,%rsi
     seteb    %al

In this situation it is safe because the comparison values are already 
in registers and code generation has permitted the bypassing of Boolean 
short-circuiting.

Rather than write an entire peephole optimisation, I would ideally like 
to program this optimisation at the nodal level, and permit the compiler 
to convert it into something resembing the following ("andn" becomes 
"notn" -> "orn", and the two "unequaln" nodes become "equaln"):

<notn resultdef="Boolean" pos="329,9" flags="nf_pass1_done" complexity="3">
    <orn resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
complexity="3">
       <equaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
complexity="1">
          <loadn resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
complexity="0">
             <symbol>self</symbol>
          </loadn>
          <niln resultdef="TObject" pos="329,9" flags="nf_pass1_done" 
complexity="0">
          </niln>
       </equaln>
       <equaln resultdef="Boolean" pos="329,9" flags="nf_pass1_done" 
complexity="1">
          <typeconvn resultdef="$char_pointer" pos="329,9" 
flags="nf_pass1_done,nf_explicit,nf_internal" complexity="0" 
convtype="tc_equal">
             <typeconvn resultdef="Pointer" pos="329,9" 
flags="nf_pass1_done" complexity="0" convtype="tc_equal">
                <loadn resultdef="<no type symbol>" pos="329,9" 
flags="nf_pass1_done" complexity="0">
                   <symbol>vmt</symbol>
                </loadn>
             </typeconvn>
          </typeconvn>
          <pointerconstn resultdef="$char_pointer" pos="329,9" 
flags="nf_pass1_done,nf_explicit" complexity="0">
             <value>$0000000000000000</value>
          </pointerconstn>
       </equaln>
    </orn>
</notn>

...but there is a catch.  In terms of the raw nodes, converting the 
"andn" node into a "notn" and an "orn" node results in a more complex 
node tree and hence less efficient assembly language in general, unless 
a particular "pass_generate_code" routine knows to look out for the 
set-up of a logical "or" that combines two variables' comparison against 
zero, like that shown above.

My question is... how should it be developed so that the node 
optimisation is performed on platforms that have the necessary assembly 
instructions, like x86_64 and AArch64 (once I develop them), but also 
not perform the optimisation on platforms that don't have the necessary 
instructions or checks in "pass_generate_code"?  Allowing the 
optimisation wouldn't cause bad code generation, but it would be more 
inefficient in the general case.

I hope I explained that okay.

(Other possibilities include introducing new node types "norn" and 
"nandn" (and maybe "nxorn" to complete the set), and having the "andn" 
node above be converted into "norn", so that they can be instantly 
converted into the relevant assembly instructions instead of relying on 
multiple steps of optimistion)

Gareth aka. Kit

-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus