[fpc-devel] using sse2 packed doubles

Daniël Mantione daniel.mantione at freepascal.org
Sun Oct 8 15:40:04 CEST 2006



Op Sun, 8 Oct 2006, schreef Vincent Snijders:

> > You are right. How about doing it in blocks of 8x8 pixels? The high
> > iteration loops are concentrated close to the borders of
> > the set, so for most blocks the iteration can then be ended early.
> =

> For starters I was thinking about blocks of 1x2 pixels ;-). The current
> hardware doesn't allow any more parallelism anyway. Or am I making a mist=
ake
> in my thinking now?

Yes. Let's say a pixel is calculated by a*b*c*d. If you calculate the =

expression serially each instruction depends on the result of the previous =

one. This is bad for pipelining; the floating point pipeline can only do 2 =

flops/cycle is the results do not depend on each other. If you first =

multiply a with b for all pixels, then with c for all pixels, then with d, =

the result of a multiplication does not depend on the previous one, and =

you get much higher throughput. At work I have to deal with this a lot =

to make applications perform fastest. My record with the HPL benchmark is =

92,9% of the theoretical limit of an Opteron core.

Intel's latest cpu's can do 4 flops/cycle, which can only be realistically =

achieved when doing these kind of parallel processing.

Never mind, lets do what is easiest first :)

Dani=EBl


More information about the fpc-devel mailing list