[fpc-devel] BlackFin

Mon Apr 16 13:57:44 CEST 2007

Hi!

Am Montag, den 16.04.2007, 11:57 +0200 schrieb Michael Schnell:
> >   r2 = r1 + r3, r4 = dm(i0,m1);  /* addition and memory access */
> >   
> Yep. In my answer to Florian I forgot that (other than ARM) the Blackfin 
> can do a calculation and a memory access in a single instruction cycle. 
> That explains the much better performance even with standard 
> (non-DSP-alike) tasks.
> >   r3 = r2 * r4, r1 = r2 + r4;    /* multiplication and addition */
> >   
> I did not know yet that it can do two independent 32 bit calculations 
> and that it can do 32 bit multiplications. Anyway, even if only two 32 
> additions can be done in one instruction cycle this is a big chance for 
> optimization.

The above code is based on an example program for some Shark or
TigerShark DSP, so its likely that the BlackFin has other processing
units. I've written the code just as an example for the algebraic style.

You have to carefully study the structure of the CPU (i.e. processing
units, busses, registers, address calculation, ...) to know what can be
done in parallel. In the example I've looked at there was a line with
4-instructions-in-1-cycle:
  f10 = f2 * f4, f12 = f10 + f12, f2 = dm(i1,m2), f4 = pm(i8,m8);
(ADSP-2106x).

In modern CPUs the parallel utilization of busses and processing units
is state of the art. The ressource allocation and parallelization is
done on the fly during program execution by some smart logic inside the
CPU. When a compiler does optimization for a certain CPU it anticipates
this and sorts the instructions and registers approprately to gain a few
percent more speed.

The beauty of DSPs is that its in the hand of the compiler (or assembler
coder) to do the full optimization.

Bye
  Hansi