> "2) Reporter's assumption about fstp is wrong: the first fstp
> instruction removes value from fpu stack, so it cannot be used for the
>
> Compiler should reuse loaded value (a[i]) and store to a[i] using
> fstl, then fstpl to a[i+1]

That is why Sergei wrote "typical common subexpression elimination". I
am sure it is a todo for the fpc team.

Also In this case optimizing this to be re-used is a small (smaller)
gain. You still have plenty of statements that recalculate the address
of an array element, using multiplication.

Introducing a temporary pointer to a[i], and using addition in each run
of the loop to increment it, will gain a lot more. (Again, I am sure it
is a todo).

Until that is done, your best choice, if you need the speed is to do
this by hand:

if cnt = 0 then exit;
tmpptrA := @a[0];
tmpptrB := @b[0];
for i := 0 to cnt - 1 do
begin
tmpptrA^ := tmpptrA^ + tmpptrB^;
tmpptrA2 := tmpptrA^;
inc(tmpptrA); // assuming a typed pointer
tmpptrA^ := tmpptrA2^;
inc(tmpptrB); // assuming a typed pointer
end;

or better
if cnt = 0 then exit;
tmpptrA := @a[0];
tmpptrB := @b[0];
for i := 0 to cnt - 1 do
begin
tmpVAlue := tmpptrA^ + tmpptrB^;
tmpptrA^ := tmpVAlue;
inc(tmpptrA); // assuming a typed pointer
tmpptrA^ := tmpVAlue;
inc(tmpptrB); // assuming a typed pointer
end;

It looses readability, so keep the good code as comment.

There is a bigger example, where exactly that was done, because FPCs
optimization was not sufficient enough for what the author wanted.
http://bugs.freepascal.org/view.php?id=10275

