[fpc-devel] Optimization theory

Sun Jun 24 23:28:52 CEST 2018

 I figured this is as good a place as any to talk about my attempt at
improving the speed of the stage 1 optimizer in issue #33855.  On paper, a
binary search is the fastest method to locate a data pointer associated
with a key (in this case, an opcode).  Of course, in practice that isn't
always the case due to overhead and the penalties (or lack of) associated
with conditional jumps.
 The main problem with my binary search implementation is the extra
complexity it puts into the code, and as Markov mentioned, the fact it's
limited to just x86 rather than being more general-purpose.  The ideal
solution is to tie the binary search and the associated key/data map into
the compiler, specifically the case node generation, but there are some
theoretical issues that need to be resolved:
 - What is the minimum number of case nodes where a binary search should be
used over a simpler linear search?  The maximum number of iterations of a
binary search loop is equal to ceil(log2(n)), where n is the size of the
list.  Further optimisations with the likes of CMOVcc and loop unrolling
can eliminate jumps entirely, but there is still significant overhead where
it might be faster to simply perform a linear search, especially where the
most frequent entry is at or near the top of the list.
 - How can it be detected that each case node has an identical setup? 
With the optimizer, the current opcode is passed through a case block, and
if a match is found, it is passed through an object method of the format
"function(var p: tai): Boolean".  The fact that the function calls are
identical means that some optimisation can be performed with common
sub-expressions, specifically passing the address of p into RDX or RSI,
depending on if the OS is Windows or Linux on x86-64.
 - What happens if there is an imperfection in the source code? A vague
question, I know, but in the case, some of the functions are of the format
"function(const p: tai): Boolean", which require a slightly different
set-up with the parameters.  This is not an error, and the functions are
sometimes set up this way if p never changes (which is the case if this
particular instruction is never deleted by the optimizer), but it breaks
the homogenuity of the case nodes and hence would make a highly-optimized
search much harder.  Should the programmer be warned or given a hint in
how to improve their code, or is it up to us to be smart enough to spot
"const p: tai" instead of "var p: tai"?

 - How easy would it be to make such a binary search more general-purpose?
My particular code simply uses a map that pairs an opcode to a pointer in a
many-to-one relationship.  In this case, the algorithm doesn't care what
the pointer is - it could easily be a label address, an object or a generic
block of memory on the heap instead of a procedural address.  Granted, as
mentioned, the situation where each case node is a function call with an
identical setup is a special case because jumps can be eliminated entirely.
 I think that's everything anyway.  I admit my motivations have slightly
negative roots because I remember when the main advertising point of Turbo
Pascal 7.0 was the speed of its compiler, and I'm harking back to those
days.  Also the size of the binaries produced when compared to, say,
Microsoft Visual C++.  Of course, size isn't as big of a concern as it was
back in the days when you were limited to a 1.44 MB floppy disk, but if
there's a space where I can eliminate or restructure inefficient or
superfluous code, I'll take it!
 If I can be filled in, how do case nodes work currently? I know there's a
kind of linear search where the ordinal variable has a number subtracted
that corresponds to the next case node, and something called a "jump
table", but I'm not entirely sure how that works yet.
 Gareth aka. Kit
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20180624/e76dbb8b/attachment.html>