[fpc-devel] Fwd: Re: ARM/AARCH64 work

Thu Aug 6 19:04:25 CEST 2020

(Accidentally sent to FLorian privately instead of to the mailing list)

> > Obviously I'll submit the changes as patches so
> they can be properly reviewed and tested, but does 
> > this sound like a good idea?
> 
> 
> 
> Yes, you might have seen, that I started already with this some time ago.

Well, that's good - means I can get straight to work and save you some time, hopefully!  AARCH64 
doesn't have that much that needs changing in that regard, although ARM needs a lot of clean-up that 
I'm getting underway with.  There's potential to merge some ARM/AARCH64 optimisations too, which 
differ only in the register names (e.g. x29 instead of r13).

I've done a little bit of refactoring of the individual optimisations to allow for small speed-ups.  
One of the main ones is an optimisation that converts 4 instructions onto one - in the trunk, it 
calls GetNextInstruction three times, along with SkipEntryExitMarker a couple of times, then checks 
to see if the individual operators and their operands permit the optimisation.  I've changed this 
around a little bit so that the first instruction is evaluated, and only then is GetNextInstruction 
called so the next instruction can be checked, given that GetNextInstruction is a relatively 
expensive call and it's more likely that the criteria for the optimisation won't be met (e.g. it 
comes across a different instruction), so the sooner you can detect this and drop out, the faster 
the Peephole Optimizer will run.

> 
> > P.S. While I haven't been asked to improve
> aarch64-linux specifically, if I'm understanding things 
> > correctly, there should be very few differences
> with the actual target platform in regards to 
> > calling conventions, for example.
> 
> 
> 
> You mean with regard to different aarch64 platforms?
> 

Yes, apologies.  I mean in regards to different aarch64 platforms.  I've got the basics of the 
calling convention down, like with passing the first integral parameter through r/x0 and 
incrementing,

In a funny way, my x86-64 machine breaking is proving to be a blessing in disguise, since I'm now 
exploring a completely different architecture and learning its assembly language.  Some potential 
speed-ups, like utilising NEON and other SIMD instruction sets, fall under the more general 
philosophy of vectorization and would be much easier to perform once all the nuances with intrinsics 
are resolved, because vectorization is much easier to do with nodes than with, say, trying to take a 
bunch of assembler instructions and attempting to vectorize those.

Gareth aka. Kit