[fpc-devel] using sse2 packed doubles
Daniël Mantione
daniel.mantione at freepascal.org
Sun Oct 8 15:40:04 CEST 2006
Op Sun, 8 Oct 2006, schreef Vincent Snijders:
> > You are right. How about doing it in blocks of 8x8 pixels? The high
> > iteration loops are concentrated close to the borders of
> > the set, so for most blocks the iteration can then be ended early.
>
> For starters I was thinking about blocks of 1x2 pixels ;-). The current
> hardware doesn't allow any more parallelism anyway. Or am I making a mistake
> in my thinking now?
Yes. Let's say a pixel is calculated by a*b*c*d. If you calculate the
expression serially each instruction depends on the result of the previous
one. This is bad for pipelining; the floating point pipeline can only do 2
flops/cycle is the results do not depend on each other. If you first
multiply a with b for all pixels, then with c for all pixels, then with d,
the result of a multiplication does not depend on the previous one, and
you get much higher throughput. At work I have to deal with this a lot
to make applications perform fastest. My record with the HPL benchmark is
92,9% of the theoretical limit of an Opteron core.
Intel's latest cpu's can do 4 flops/cycle, which can only be realistically
achieved when doing these kind of parallel processing.
Never mind, lets do what is easiest first :)
Daniël
More information about the fpc-devel
mailing list