[fpc-devel] x86_64/amd64 asmcse and peephole optimizer port

Mon Nov 15 13:37:20 CET 2010

On Wed, Nov 10, 2010 at 2:27 PM, Jonas Maebe <jonas.maebe at elis.ugent.be> wrote:
>
> On 30 Oct 2010, at 13:20, Matthias K. wrote:
>
>> the last days I've done a first step in Porting the i386 data flow
>> analyzer, asmcse and  peephole optimizations.
>
> Quite impressive!
>
>> Main motivation is: target instruction level optimization is always a
>> good thing especially for bottlenecks.
>
> That's true. There's one small problem though: the asmcse optimiser
> (csopt386, and large parts of daopt386) has been on its way out the last
> couple of releases, because it contains some bugs that are very hard to fix
> due to it not being very good/clean code. It is already no longer activated
> by default for -O2 since FPC 2.4.0, and currently has to be enabled
> explicitly via -Ooasmcse.
>
> The final drop in the bucket that caused it to be disabled by default was
> http://bugs.freepascal.org/view.php?id=14363
>
>> The main target was: porting the i386 optimization part to x86_64
>> (amd64) and merging it back such that generic x86 optimization is in
>> one place.
>
> If you are willing to take responsibility for that code (feel free to
> completely rewrite it), that would be great. Then it can maybe be enabled
> again by default.

Thanks for the info (and especially for the "impressive" :) ).
Some analysis parts are not working correctly for x64 and it is hard
to track the Problems down. Furthermore I assume that some parts
simply don't work because of "special x64 Problems" and x64 code
generation, like upper 32bit component cleaning with "and $FFFFFFFF,x"
and a lot of "opsize differences" like transfer/arithmetic on
8/16/32/64bit which needs additional code for handling/post-processing
and checks.

For short, the asmcse part isn't working correctly and the bug is not
triggered. The only thing I've seen working is a bit of value
propagation for loads, which rewrites some common ref,reg moves into
reg,reg moves.

Anyways, I'm already working on a rewrite. But its currently unclean
and more like educational prototype work ;).
Instead of the Node based approach, I've started with a "something
like base blocks" rewrite for Block Local and Block Global (Over Path)
analysis.
For Example the Label Optimizations (inverting conditions, removing
trivial jmps, rewriting jump chains) are implemented now and do mostly
the same as the generic Optimizations which is implemented in
aopt*.pas with the difference that they run over Blocks.
(The second difference is, the new approach removes more labels. Seems
like a missing ...ref^.symbol.decrefs somewhere in the generic part.)

For this "educational prototype" I need to rewrite parts of the
analysis which is a good way for finding Problems and removing some of
the minor Problems too. Second thing is,
i learn a lot about the fpc Internals in between (all units with
cg*/ra*/aasm*/cpu*) and can take apart some things that are mixed in
the Node based Optimization.

For short (again), I'm looking into it and I'm both interested in that
code and willing to take responsibility for anything I'm writing. But
it'll take some time to fix everything up to the cse, as always.
Another Question would be: is there any documentation (except in
source) about the generic target/i386 optimizer parts, assumptions
about the code generation etc.? Because it may be a good thing to
write some documentation in parallel (basics, like: optimization is
performed on "per proc basis" without assembler block, some stuff
about markers/reg allocation info/..., specific i386/x86 assumptions
about register order and mapping and so on) which could help fix
Problems later on and speed up the learning phase for any interested
Devel.

>> This is currently not complete, i didn't merge it back since there is
>> still testing and review todo. But from the current point of view it
>> should be rather simple to to merge the data flow analysis and the
>> asmcse parts. The peephole part is another point, that should be pure
>> cpu/target specific.
>
> I guess there are some common ones there as well, no? (especially regarding
> mov's and jump chaining).
>
>> Like I stated above, the current approach needs further testing (fpc
>> testsuite returns same result for patched and unpatched compiler with
>> "make full", but there may be things missing) and review from others
>> (hopefully with more knowledge about the x86_64 code generator part
>> and potential optimizations). Thats why I'm attaching my current
>> approach here.
>
> At first look, I think it's ok except for the indentation. Please use the
> same style as the original code (e.g., indenting "begin" after "if ...").
> See http://wiki.freepascal.org/Coding_style for some more info.
>
>> TODO: There is potential for further optimizations, especially for x87
>> and 128bit Media/XOP/FM4.. but the code needs some cleanups before and
>> possibly some bug fixes
>>
>> I'm open for any feedback, bugfixes and so on (and if it should be
>> merged with i386 parts)
>
> Merging with i386 is fine! The whole assembler optimiser infrastructure is
> also quite independent from the rest of the compiler, which makes it a very
> good way to get started (it's how I rolled into FPC development in 1997,
> which is in large part why the code's organisation is so bad :)

Hehe, yes. At least it is more understandable than other parts and i
like the concept of optimizations on "a simple, robust list" of
instructions and informations per proc.
By the way, I merged the *opta64 code back and introduced some new
constants for simpler merging. The merge was like a second rewrite of
the Parts that i changed for the x64 port, f.e. most RS_* register
renaming is dropped because of the mappings. I've fixed the coding
style while merging.
Problem is, like i said above, the cse part is not really working.
There are still 2 open questions about the deallocation of registers
for procs and register rdx in special cases like 128bit result from
imul. And the Peephole code is mostly unreadable due to massive
{$ifdef} usage in the merge, there are common parts yes but the big
Problem are "not-common" parts like the amd64 imul parts.

Having said that, I'll open a bug for the patch proposal and
discussions about which parts should be redone. At least the Peephole
Code is usable and introduces minor improvements and a Basis for x64
sequence alternatives. (asmcse is deactivated and could be removed
completely for x64, also peephole could be deactivated for now)

Thanks for the Feedback and Informations about the da/cse parts.

 Bye,
  Matthias Karbe