[fpc-devel] Guidance for code generation for shift operations for AVR target

Mon Aug 19 22:36:18 CEST 2019

Am 19.08.2019 um 22:20 schrieb Christo Crause:
> I'm interested in trying to improve the code generated for shift
> operations (in particular involving a compile time constant shift) for
> the AVR target.  The AVR processor doesn't have a barrel shifter,
> instead it can only shift a single bit position per clock cycle.
> Currently the compiler by default generates a bit shift loop where the
> loop is executed n times to push a value by n bits.  The only
> optimization I noticed for the case of shifting a value by a compile
> time constant is in Tcg.a_op_const_reg_reg where an 8 bit shift of a 16
> bit value is converted to copying the low byte of the left operand into
> the high byte of the result, and setting the low byte of the result to 0.
> 
> I would like to extend this type of optimization to cover more cases -
> the obvious extension is to convert all shifts by 8 bit multiples by
> corresponding byte moves. A more general approach (which I've got
> working for shl as concept) is to at least convert all 8 bit multiples
> as byte moves, then just do the last few bit shifts (if any) either as
> an unrolled loop (e.g. as implemented in tcgavr.a_op_const_reg_internal)
> or by generating the conventional shift loop (as implemented in
> tcgavr.a_op_reg_reg_internal). At the moment I've implemented the this
> logic in tcgavr.a_op_const_reg_reg.  I first check if I can generate
> smaller code compared to a shift loop and if not, the code calls the
> inherited method  a_op_const_reg_reg which basically follows the
> existing path (see also attached patch)
> 
> The code generator is complex to follow for me, since the functionality
> is kind of normalized and distributed across generic and CPU specific
> parts, and code flow jumps around up and down the inheritance chain. 

Yes, nice OOP ;)

> I
> therefore have some questions around this proposed modification and
> implementation in the code generator for which I would like some
> guidance on:
> * Is tcgavr.a_op_const_reg_reg the correct place for this type of
> functionality?
> * Am I messing up something else by effectively moving most of the code
> generation for this case higher up the call chain?
> * Am I missing some other path which could also benefit from this
> optimization?
> * Should I try and generate different code depending on whether -Os is
> specified or not (e.g. perform more loop unrolling if -Os is not specified)?
> * Any comments on the patch, which is a work in progress?

For me the idea looks good, actually, I planned once to do the same, but
got stuck for time reasons with the beginnings you discovered.

If such a patch really works, can normaly only checked by testing.
Besides simple benchmarking I often do another thing: I build without
the patch with -al, copy all .s files to a tmp dir, then I do a build
with the patch with -al, copy all resulting .s files to another temp.
dir and then I look at the differences with WinMerge. It becomes pretty
fast obvious if such a patch works good.