[fpc-pascal]Performance: Free Pascal vs GNU Pascal

Mon Dec 30 14:43:47 CET 2002

Michael.VanCanneyt at Wisa.be wrote:
 > I don't know. I do know that I've been able to speed up some of the
 > other algorithms by as much as 3 to 4 times; Others do not lend
 > themselves to much improvement. I haven't seen the C implementation,
 > so I don't know how they do it in C, maybe I could get a hint to make
 > it work faster in FPC.

On relevant problems the matrices didn't fit into the cache. (If they
does, it is unsignificant whether multiplication takes 10ms or 100ms).

Let me ask how you multiply. I hope you do not perform

for i:=0 to n-1 do
  for j:=0 to n-1 do
   for k:=0 to n-1 do
    a(i,j):=a(i,j) + b(k,j)*c(i,k)
                       |      |
            _column_ access  row access

These BLAS 1 are optimal for [RC]ISC _without_ cache or if the matrices
fit into the cache.

BLAS2 routines looks like:
for i:=0 to n-1 do
  for j:=0 to n-1 do
   for k:=0 to n-1 do
    a(i,k):=a(k,i) + b(j,k)*c(i,j)
       |              |
    row access    row access

They are optimal for vector computers.

BLAS3 routines are optimal for [RC]ISC _with_ cache. The matrix is
splitted into submatrixes, which fit into the cache. For example
     [ A1  A2 ]
A = [        ] and so on
     [ A3  A4 ]

A1 = B1*C1 + B2*C3
A2 = B1*C2 + B2*C4
A3 = B3*C1 + B4*C3
A4 = B3*C2 + B4*C4

The multiplikations Bi*Ci are BLAS 1 routines.

An other optimazion is unrolling. But it depends form the CPU (amount of
registers) and the compiler. Probably it doesn't help verry much with
fpc and IA32.

Here is on source for both:
Stephan Seidl, Code Crumpling: A Straight Technique to Improve Loop
Performance on Cache Based Systems, see
http://rcswww.urz.tu-dresden.de/~seidl/publications/

I have some C source code from the author, but unfortunately only on
paper. (This code is verry awful and long, because of unrolling, contact
me privately if you want an copy)

Literature about BLAS you find in the Internet (google).

A third way is performing the Strassen-Algorithm. By tricky substitution
and splitting you need less multiplications but (much) more additions.
The total computational costs are of the order O(n^(2.81))=O(n^(log_2 
7)). The disadvantage is the reduced accuracy (because of the 
additions). But not with exact calculation like integer matrices (as in 
the example of Florian Klaempfl).

Regards Stefan

-- 
mailto:stefan.ziegenbalg at mailbox.tu-dresden.de
http://www.sziegenbalg.de
http://www.simage.org