[fpc-devel] Optimization of redundant mov's

Sun Mar 19 10:25:07 CET 2017

Martok wrote:

>  a:= CurrentHash[0]; b:= CurrentHash[1]; c:= CurrentHash[2]; d:= CurrentHash[3];
> 0000000100074943 488b8424a0020000         mov    0x2a0(%rsp),%rax
> 000000010007494B 4c8b5038                 mov    0x38(%rax),%r10
> 000000010007494F 488b8424a0020000         mov    0x2a0(%rsp),%rax
> 0000000100074957 4c8b5840                 mov    0x40(%rax),%r11
> 000000010007495B 488b9424a0020000         mov    0x2a0(%rsp),%rdx
> 0000000100074963 488b4248                 mov    0x48(%rdx),%rax
> 0000000100074967 488b9424a0020000         mov    0x2a0(%rsp),%rdx
> 000000010007496F 488b6a50                 mov    0x50(%rdx),%rbp
> 
> Every single one of the "mov 0x2a0(%rsp), %rxx" instructions except the first is
> redundant and causes another memory round-trip. At the same time, more registers
> are used, which probably makes other optimizations more difficult, especially
> when something similar happens on i386.
> 
> Now, the fun part: I haven't been able to build a simple test that causes the
> same issue (the self-pointer already is in %rcx and not fetched from the stack
> each time), so I have a feeling this may be a side effect of some other part of
> the code.

It's called register spilling: once there are no registers left to hold
values, the compiler has to pick registers whose value will be kept in
memory instead. Register allocation is an NP-complete problem, so the
result will never be 100% optimal (at least if you don't want to wait
forever while the compiler checks out all possible assignments). One
possible heuristic, which is used by FPC's register allocator, is to
spill the register that conflicts with the largest number of other
registers (to minimise the number of registers spilled to memory).

There are techniques to more optimally spill (e.g. live range
splitting), and there are also other kinds of optimisations that could
be run after register allocation to make the code more optimal. CSE at
the assembler level could be used in this case. That's a very complex
undertaking for relatively little gain though. E.g. those memory loads
are probably optimised by the processor itself (not necessarily coming
even from the L1 cache, but possibly from the write-back buffer).

Jonas