[fpc-devel] Loop unrolling question

J. Gareth Moreton gareth at moreton-family.com
Sat Aug 4 01:25:14 CEST 2018


(You replied to me directly, Kirinn, not 
to the mail list)

Thanks Kirinn. The loop count will likely 
be only 6 iterations, at least in my test 
case which is stage 1 of the peephole 
optimizer in the i386 and x86-64 build of 
the compiler, consisting of (currently) 36 
branches that call identically-structured 
functions.

In the plan I have currently, I won't need 
an extra comparison because it would be 
something like this:

cmp (tableaddr), %value
je @MatchFound
cmovle %midindex, %loindex
cmovge %midindex, %hiindex

I can see it maybe being a waste on the 
last iteration, maybe the second-to-last 
too. It would require some empirical tests 
though.

Gareth aka. Kit

On Fri 03/08/18 23:07 , "Kirinn" 
kirinn at mooncore.eu sent:
> Each comparison and conditional jump 
takes 2 instructions immediately,
> which is reduced to 1 instruction 
through macro-op fusion, which means a
> single loop iteration becomes maybe 12% 
more costly (assuming simple
> instructions)... but every processor 
varies, of course.
> 
> Forward jumps are normally predicted to 
not be taken, so there is no
> further execution penalty if it's not an 
exact match. 
> 
> The additional instructions do take up a 
little bit of space in the
> instruction cache, and the conditional 
jumps may mess up the existing
> branch prediction table, causing a 
presumably inconsequential penalty
> slightly afterward.
> 
> The penalty for a branch misprediction 
is a pipeline flush. According to
> Agner Fog's research 
(https://www.agner.org/optimize/microarchi
tecture.pdf
> [1]), the penalty varies by processor. 
Pentium MMX 4-5 cycles, PPro/P2/P3
> 10-20 cycles, P4 25-100+ cycles, Core2 
about 15 cycles, Nehalem 17+ cycles,
> Sandy/Ivy Bridge 15+ cycles, Haswell+ 
15-20 cycles, AMD Bulldozer etc ~20
> cycles, Ryzen ~18 cycles. Evidently, 
processors are designed to execute 2-5
> instructions per cycle, complicated a 
lot by micro-op splitting and
> pipelining, so we might guesstimate a 
misprediction penalty to be worth
> around 30-100 instructions? 
> 
> Assuming case blocks tend to be 
relatively small, running through without
> the conditional checks is probably 
usually faster. But the only way to be
> sure is to test on a bunch of 
processors... 
> 
> ~Kirinn
> 
> On 08/03/2018 08:57 PM, J. Gareth 
Moreton wrote:
> Hi everyone,
> 
> So I'm pondering about attempting to 
implement the binary search system I
> devised for case blocks.  To find the 
correct case label, you need to
> iterate a search loop a fixed maximum of 
ceil(log2(n)) times, where n is
> the number of individual case values 
(excluding "else").  Given that this
> is quite a low number and I managed to 
get the loop itself down to just 8
> instructions (not including the 
comparison and conditional jump at the
> end), this could be unwrapped for low 
values of ceil(log2(n))... possibly
> up to 7 or 8, I'd say.  However, if an 
exact match is found early, how
> serious is the penalty of having a 
conditional jump forward on each
> unrolled iteration for x86-64, whether 
it's taken or not? (This would be
> the equivalent of a Break command)
> 
> Gareth aka. Kit
> 
> 
__________________________________________
_____ 
> fpc-devel maillist - fpc-
devel at lists.freepascal.org 
> http://lists.freepascal.org/cgi-
bin/mailman/listinfo/fpc-devel 
> 
> 
> 
> Links:
> ------
> [1] 
https://www.agner.org/optimize/microarchit
ecture.pdf
> 
> 




More information about the fpc-devel mailing list