[fpc-devel] Parallel Computing

Mon Nov 3 16:31:59 CET 2008

Op Mon, 3 Nov 2008, schreef Florian Klaempfl:

> Well, those tests even don't take care of thread starting time :)

Threads are started at application startup, in fact my command lines were:

[cvsupport at node001 ~]$ OMP_NUM_THREADS=1 ./stream_omp
[cvsupport at node001 ~]$ OMP_NUM_THREADS=8 ./stream_omp

Theads not needed are simply blocked until an OpenMP loop activates them.

> Taking advantage of MT requires always deep knowledge about the used 
> architecture and the code being executed and this is something OpenMP 
> ignores. For a big vector operation the number of used threads should be 
> adapted to the memory architecture

... and bound to the correct cores, i.e.:

[cvsupport at node001 stream]$ OMP_NUM_THREADS=2 numactl --physcpubind=0,4 
./stream_omp

... gives:

-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        8294.5171       0.0139       0.0154       0.0155
Scale:       8191.0001       0.0141       0.0156       0.0157
Add:         7920.1633       0.0218       0.0242       0.0244
Triad:       7990.9738       0.0217       0.0240       0.0241
-------------------------------------------------------------

But:

[cvsupport at node001 stream]$ OMP_NUM_THREADS=2 numactl --physcpubind=0,4 
./stream_omp

... gives:

-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       11603.7546       0.0099       0.0110       0.0110
Scale:      11152.7465       0.0104       0.0115       0.0116
Add:        10795.5704       0.0160       0.0178       0.0179
Triad:      10881.7832       0.0159       0.0176       0.0177
-------------------------------------------------------------

So, you need knowledge about the underlying NUMA architecture to get the 
best performance.

> for computational intensive applications like Mandelbrot the number of 
> threads must be adapted to the number of available virtual cores.

Exactly.

By the way, GCC is totally unsuitable for this benchmark, both its 
OpenMP implementation as it's loop vectorizers are too weak. You need 
Intel or Pathscale to reproduce these results.

Daniël