[fpc-devel] using sse2 packed doubles

Sun Oct 8 15:40:04 CEST 2006

Op Sun, 8 Oct 2006, schreef Vincent Snijders:

> > You are right. How about doing it in blocks of 8x8 pixels? The high
> > iteration loops are concentrated close to the borders of
> > the set, so for most blocks the iteration can then be ended early.
> 
> For starters I was thinking about blocks of 1x2 pixels ;-). The current
> hardware doesn't allow any more parallelism anyway. Or am I making a mistake
> in my thinking now?

Yes. Let's say a pixel is calculated by a*b*c*d. If you calculate the 
expression serially each instruction depends on the result of the previous 
one. This is bad for pipelining; the floating point pipeline can only do 2 
flops/cycle is the results do not depend on each other. If you first 
multiply a with b for all pixels, then with c for all pixels, then with d, 
the result of a multiplication does not depend on the previous one, and 
you get much higher throughput. At work I have to deal with this a lot 
to make applications perform fastest. My record with the HPL benchmark is 
92,9% of the theoretical limit of an Opteron core.

Intel's latest cpu's can do 4 flops/cycle, which can only be realistically 
achieved when doing these kind of parallel processing.

Never mind, lets do what is easiest first :)

Daniël