[fpc-pascal] Floating point question

Tue Feb 13 09:21:05 CET 2024

It occurs to me that there is merit in reduction of precision to increase
performance, and so I'm trying to learn how to do this correctly, but the
thing that confuses me is that math with constants doesn't seem to be the
same as math with variables, and I don't know why.

It also looks to me like when there is an expression such as:
e := 8427.0 + 33.0 / 1440.0;
what is happening each term of the  expression is evaluated individually to
see if it can be reduced in precision, and then the math is carried out, but
if the math was carried out at full precision first by the compiler, THEN
the entire answer was evaluated to see if it can be reduced in precision,
the results would be what we are all expecting. 

Regardless of that however, when I am working with variables, an integer
added to a byte that has been divided by a single results in an
extended...it's legitimate to expect you could get an extended result from
such an operation, just as dividing a byte by another byte could result in
an extended answer.  With variables, this seems to always be the case, but
with constants, it does not seems to be the case.  If constants just did the
math the same as variables, then all this reduction in precision stuff would
work flawlessly for everyone without re-casting everything.

Please consider the code below, I am comparing the results to what I get
when I perform this math with the Windows Calculator, as you can see no
matter how I cast it, when using variables, I get the expected answer, but
when the compiler does the math, it's not working the same way. 
What seems to be happening with variables is that the answer to lower
precision entities can result in higher precision results, while with
constants, the resulting precision is limited in some way, but in a way I
don't understand, because it's being reduced to single precision, but the
lowest precision element is a byte.

In other words with variables a byte / single is perfectly capable of
producing an extended result, without re-casting.  But with constants doing
the exact same thing forces the result to always be a single.

I don't think the real issue has anything to do with this reduction in
precision at all, I think it has to do with whatever causes the compiler to
do math differently than the executing program does with variables.  I don't
understand why I must individually re-cast every element of the equation
using constants to extended, while when I do the exact same thing with
variables it's not necessary. 

I am wondering if the way the compiler does the math, it's is expecting that
all constants would be full precision, and therefore the way it did the math
before always came out right, but when the change was made in 2.2 to reduce
the precision to variables, no corresponding adjustment was made to the way
the compiler carries out math to compensate for the possibility that there
was such a thing as a constant with reduced precision.   So the compiler is
doing math as if all input terms are at highest precision, therefore not
needing to bother considering the answer might be higher precision than the
input terms, but now that there is the possibility of the result being of
higher precision, some adjustment to the way math is done by the compiler is
necessary. 

I just think if the compiler did all the math the same way the executing
program does with math with variables, then everything is solved for
everyone... without any re-casting or unexpected results due to division,
and while also preventing unnecessary precision.  this has nothing to do
with the reduction of precision, only the way the compiler is doing it's
calculations needs to be adjusted for this new situation.

Just fixing the way the compiler does the math also requires no knowledge of
the left side of the equation by the right.  The compiler just needs to do
the calculations the same way as variables are calculated with the extra
step of re-evaluating to see if the precision can be reduced when it's done.

James

program Const_Vs_Var;

Const
   A_const = Integer(8427);
   B_const = Byte(33);
   C_const = Single(1440.5);
   Win_Calc = 8427.0229087122526900381811870878;
   Const_Ans = A_Const+B_Const/C_Const;

Var
   A_Var : Integer;
   B_Var : Byte;
   C_Var : Single;
   Const_Ans1, Const_Ans2, Const_Ans3, Var_Ans1, Var_Ans2, Var_Ans3 :
Extended;

Begin
   A_Var := A_Const;
   B_Var := B_Const;
   C_Var := C_Const;

   Var_Ans1   := A_Var+B_Var/C_Var;
   Const_Ans1 := A_Const+B_Const/C_Const;
   Var_Ans2   := Integer(A_Var)+Byte(B_Var)/Single(C_Var);
   Const_Ans2 := Integer(A_Const)+Byte(B_Const)/Single(C_Const);
   Var_Ans3   := Extended(A_Var)+Extended(B_Var)/Extended(C_Var);
   Const_Ans3 := Extended(A_Const)+Extended(B_Const)/Extended(C_Const);

   WRITELN ( '   Win_Calc = ',   Win_Calc:20:20) ;
   WRITELN ( '  Const_Ans = ',  Const_Ans:20:20 ,'  Win_Calc-Const_Ans =
',Win_Calc-Const_Ans:20:20) ;
   WRITELN ( ' Const_Ans1 = ', Const_Ans1:20:20 ,' Win_Calc-Const_Ans1 =
',Win_Calc-Const_Ans1:20:20) ;
   WRITELN ( ' Const_Ans2 = ', Const_Ans2:20:20 ,' Win_Calc-Const_Ans2 =
',Win_Calc-Const_Ans2:20:20) ;
   WRITELN ( ' Const_Ans3 = ', Const_Ans3:20:20 ,' Win_Calc-Const_Ans3 =
',Win_Calc-Const_Ans3:20:20) ;
   WRITELN ( '   Var_Ans1 = ',   Var_Ans1:20:20 ,'   Win_Calc-Var_Ans1 =
',Win_Calc-Var_Ans1:20:20) ;
   WRITELN ( '   Var_Ans2 = ',   Var_Ans2:20:20 ,'   Win_Calc-Var_Ans2 =
',Win_Calc-Var_Ans2:20:20) ;
   WRITELN ( '   Var_Ans3 = ',   Var_Ans2:20:20 ,'   Win_Calc-Var_Ans3 =
',Win_Calc-Var_Ans3:20:20) ;
End.

   Win_Calc = 8427.02290871225268987000
  Const_Ans = 8427.02246100000000000000  Win_Calc-Const_Ans =
0.00044777475268986677
 Const_Ans1 = 8427.02246093750000000000 Win_Calc-Const_Ans1 =
0.00044777475268986677
 Const_Ans2 = 8427.02246093750000000000 Win_Calc-Const_Ans2 =
0.00044777475268986677
 Const_Ans3 = 8427.02290871225268987000 Win_Calc-Const_Ans3 =
0.00000000000000000000
   Var_Ans1 = 8427.02290871225268987000   Win_Calc-Var_Ans1 =
0.00000000000000000000
   Var_Ans2 = 8427.02290871225268987000   Win_Calc-Var_Ans2 =
0.00000000000000000000
   Var_Ans3 = 8427.02290871225268987000   Win_Calc-Var_Ans3 =
0.00000000000000000000