[fpc-devel] Guidance for code generation for shift operations for AVR target

Mon Aug 19 22:20:14 CEST 2019

I'm interested in trying to improve the code generated for shift operations
(in particular involving a compile time constant shift) for the AVR
target.  The AVR processor doesn't have a barrel shifter, instead it can
only shift a single bit position per clock cycle. Currently the compiler by
default generates a bit shift loop where the loop is executed n times to
push a value by n bits.  The only optimization I noticed for the case of
shifting a value by a compile time constant is in Tcg.a_op_const_reg_reg
where an 8 bit shift of a 16 bit value is converted to copying the low byte
of the left operand into the high byte of the result, and setting the low
byte of the result to 0.

I would like to extend this type of optimization to cover more cases - the
obvious extension is to convert all shifts by 8 bit multiples by
corresponding byte moves. A more general approach (which I've got working
for shl as concept) is to at least convert all 8 bit multiples as byte
moves, then just do the last few bit shifts (if any) either as an unrolled
loop (e.g. as implemented in tcgavr.a_op_const_reg_internal) or by
generating the conventional shift loop (as implemented in
tcgavr.a_op_reg_reg_internal). At the moment I've implemented the this
logic in tcgavr.a_op_const_reg_reg.  I first check if I can generate
smaller code compared to a shift loop and if not, the code calls the
inherited method  a_op_const_reg_reg which basically follows the existing
path (see also attached patch)

The code generator is complex to follow for me, since the functionality is
kind of normalized and distributed across generic and CPU specific parts,
and code flow jumps around up and down the inheritance chain.  I therefore
have some questions around this proposed modification and implementation in
the code generator for which I would like some guidance on:
* Is tcgavr.a_op_const_reg_reg the correct place for this type of
functionality?
* Am I messing up something else by effectively moving most of the code
generation for this case higher up the call chain?
* Am I missing some other path which could also benefit from this
optimization?
* Should I try and generate different code depending on whether -Os is
specified or not (e.g. perform more loop unrolling if -Os is not specified)?
* Any comments on the patch, which is a work in progress?

best wishes,
Christo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20190819/277af5d4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shiftbyconst.patch
Type: text/x-patch
Size: 2180 bytes
Desc: not available
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20190819/277af5d4/attachment.bin>