[fpc-devel] using sse2 packed doubles
Daniël Mantione
daniel.mantione at freepascal.org
Sun Oct 8 15:40:04 CEST 2006
Op Sun, 8 Oct 2006, schreef Vincent Snijders:
> > You are right. How about doing it in blocks of 8x8 pixels? The high
> > iteration loops are concentrated close to the borders of
> > the set, so for most blocks the iteration can then be ended early.
> =
> For starters I was thinking about blocks of 1x2 pixels ;-). The current
> hardware doesn't allow any more parallelism anyway. Or am I making a mist=
ake
> in my thinking now?
Yes. Let's say a pixel is calculated by a*b*c*d. If you calculate the =
expression serially each instruction depends on the result of the previous =
one. This is bad for pipelining; the floating point pipeline can only do 2 =
flops/cycle is the results do not depend on each other. If you first =
multiply a with b for all pixels, then with c for all pixels, then with d, =
the result of a multiplication does not depend on the previous one, and =
you get much higher throughput. At work I have to deal with this a lot =
to make applications perform fastest. My record with the HPL benchmark is =
92,9% of the theoretical limit of an Opteron core.
Intel's latest cpu's can do 4 flops/cycle, which can only be realistically =
achieved when doing these kind of parallel processing.
Never mind, lets do what is easiest first :)
Dani=EBl
More information about the fpc-devel
mailing list