[fpc-devel] Loop unrolling question
J. Gareth Moreton
gareth at moreton-family.com
Sat Aug 4 01:25:14 CEST 2018
(You replied to me directly, Kirinn, not
to the mail list)
Thanks Kirinn. The loop count will likely
be only 6 iterations, at least in my test
case which is stage 1 of the peephole
optimizer in the i386 and x86-64 build of
the compiler, consisting of (currently) 36
branches that call identically-structured
functions.
In the plan I have currently, I won't need
an extra comparison because it would be
something like this:
cmp (tableaddr), %value
je @MatchFound
cmovle %midindex, %loindex
cmovge %midindex, %hiindex
I can see it maybe being a waste on the
last iteration, maybe the second-to-last
too. It would require some empirical tests
though.
Gareth aka. Kit
On Fri 03/08/18 23:07 , "Kirinn"
kirinn at mooncore.eu sent:
> Each comparison and conditional jump
takes 2 instructions immediately,
> which is reduced to 1 instruction
through macro-op fusion, which means a
> single loop iteration becomes maybe 12%
more costly (assuming simple
> instructions)... but every processor
varies, of course.
>
> Forward jumps are normally predicted to
not be taken, so there is no
> further execution penalty if it's not an
exact match.
>
> The additional instructions do take up a
little bit of space in the
> instruction cache, and the conditional
jumps may mess up the existing
> branch prediction table, causing a
presumably inconsequential penalty
> slightly afterward.
>
> The penalty for a branch misprediction
is a pipeline flush. According to
> Agner Fog's research
(https://www.agner.org/optimize/microarchi
tecture.pdf
> [1]), the penalty varies by processor.
Pentium MMX 4-5 cycles, PPro/P2/P3
> 10-20 cycles, P4 25-100+ cycles, Core2
about 15 cycles, Nehalem 17+ cycles,
> Sandy/Ivy Bridge 15+ cycles, Haswell+
15-20 cycles, AMD Bulldozer etc ~20
> cycles, Ryzen ~18 cycles. Evidently,
processors are designed to execute 2-5
> instructions per cycle, complicated a
lot by micro-op splitting and
> pipelining, so we might guesstimate a
misprediction penalty to be worth
> around 30-100 instructions?
>
> Assuming case blocks tend to be
relatively small, running through without
> the conditional checks is probably
usually faster. But the only way to be
> sure is to test on a bunch of
processors...
>
> ~Kirinn
>
> On 08/03/2018 08:57 PM, J. Gareth
Moreton wrote:
> Hi everyone,
>
> So I'm pondering about attempting to
implement the binary search system I
> devised for case blocks. To find the
correct case label, you need to
> iterate a search loop a fixed maximum of
ceil(log2(n)) times, where n is
> the number of individual case values
(excluding "else"). Given that this
> is quite a low number and I managed to
get the loop itself down to just 8
> instructions (not including the
comparison and conditional jump at the
> end), this could be unwrapped for low
values of ceil(log2(n))... possibly
> up to 7 or 8, I'd say. However, if an
exact match is found early, how
> serious is the penalty of having a
conditional jump forward on each
> unrolled iteration for x86-64, whether
it's taken or not? (This would be
> the equivalent of a Break command)
>
> Gareth aka. Kit
>
>
__________________________________________
_____
> fpc-devel maillist - fpc-
devel at lists.freepascal.org
> http://lists.freepascal.org/cgi-
bin/mailman/listinfo/fpc-devel
>
>
>
> Links:
> ------
> [1]
https://www.agner.org/optimize/microarchit
ecture.pdf
>
>
More information about the fpc-devel
mailing list