[fpc-pascal] Floating point question

Mon Feb 12 13:33:57 CET 2024

>>Overall, the intermediate float precision is a very difficult topic.

I agree it's a difficult topic, it all comes down to what your program is
doing, and whether you need performance or precision.

>>And generate the slowest code possible on most platforms.

I can appreciate the need to reduce precision where it's possible for the
sake of performance, especially when it won't make any difference.

What makes it difficult is there are many different reasons for wanting it
one way, or the other, it depends on the purpose of the program, and the
compiler has no way to know what the purpose is.   

It occurs to me that one could want part of a program to be optimized for
performance and another part of the same program to be optimized for
precision, for example if you are doing calculations to generate geometry,
and also want to display the geometry on the screen, the data you write out
to a file you would want maximum precision, but since what you will display
on the screen will eventually become only integer values of pixels you want
to do that math as fast as possible, especially if you want to pan / zoom /
rotate, and even though what the screen data is based on might be double
precision or more, I can see how reducing its precision as fast as possible
would be beneficial to increase performance. 

So I’m trying to learn something, I agree it would be better have
performance where it’s possible and precision when needed.  But I just don't
understand what is going on.   I'm not trying to say that this reduction in
precision should not be done, I'm understanding the value in it.  I’m trying
to figure out why the math done with constants where the compiler is doing
the math is not the same as when the program does with math with variables.
If the solution is to typecast where needed to get the desired results, they
why isn’t it working the way I expect it to?

Below is a sample program, I’m not trying to make everything extended, in
fact quite the opposite, there is no need for the input constants /
variables to be Extended because they all fit perfectly in smaller data
types, so I put them all into smaller datatypes as an example.  I am
defining constants explicitly and defining variables the exact same way, so
I’m comparing apples to apples here, I have A as always an Integer, B as
always a Byte, and C as always a single, with a value the fits in a single. 

My goal is to add the integer to a byte that’s been divided by a single and
get the result in Extended.  When I do this with the variables, everything
is as I expected, when I do this with constants, it’s not as I expect.
This is what I don’t understand, and if this worked as expected then I think
everyone is happy.    What ever is happening for it to work correctly during
program execution should also be happening when the compiler does the math.
The problem isn’t that the constants got stored in lower precision it’s that
they are somehow forcing the result of the calculation to also be at the
lower precision and not re-evaluated after the math.  It’s completely
legitimate to divide a low precision number by a low precision number and
get a high precision result, it works with Variables, why doesn’t it work
with Constants?

I suspect that what’s happened is that there is something missing in the way
the compiler does math, something that is not needed if it was always done
at maximum precision, but that is needed with mixed precision.   It’s not
that the fact that the constants were reduced in precision, it’s something
to do with the way the math is done with constants of reduced precision that
isn’t being accounted for, and that is not necessary if calculating with
full precision.   It’s not that the changes in 2.2 are the problem at all,
it’s that something else needed to be done at the same time that was missed.

The only way I can get the correct result when using constants is to re-cast
ALL of them as extended, not just the ones involving division, and not the
entire formula, but every single constant.   This is what I don’t
understand.  

>>The evaluation of the expression on the right of := does not know (and
should not know) what the type is of the expression on the left.
Why can’t the compiler do tall the math at full precision and then evaluate
only the result to see if that can be stored in a lower precision.  If the
expression on the right cannot and should not know the type on the left,
then there is a good possibility that it’s a high precision data type, and
then there should be some provision to safeguard against data loss if the
type is of high precision. 
Why doesn’t this work?    JJ := Extended(A_Const+B_Const/C_Const); It
requires no knowledge of what is on the left.
Why can’t the math be done with high precision and the result be reduced to
the smallest datatype,  Math with low precision data types often results in
high precision results. 

If I want to have a mixed program with portions in high precision and
portions that are highest performance possible, then what is the correct way
to accomplish the precision portions?   Are we supposed to re-cast every
constant at highest precision in every formula to make sure we don’t lose
data?
This doesn’t need to be done with Variables, why does it need to be done
with constants?

Please see my comments in the sample program. I hope it is readable, because
sometimes e-mail breaks lines where I don’t intend it to.

James

program Const_Vs_Var;

Const
   A_const = Integer(8427);
   B_const = Byte(33);
   C_const = Single(1440.5);

Var
   A_Var : Integer;
   B_Var : Byte;
   C_Var : Single;
   FF, GG, HH, II, JJ, KK, LL : Extended;

Begin
   A_Var := A_Const;
   B_Var := B_Const;
   C_Var := C_Const;

   FF := A_Var+B_Var/C_Var;                                      
   // This is the baseline, The math done with variables comes out the way I
expect it to.

   GG := Integer(A_Var)+Byte(B_Var)/Single(C_Var);
   // This is just for emphasis that I am doing the math with the data types
explicitly defined and I get the correct results.

   HH := Integer(A_Const)+Byte(B_Const)/Single(C_Const);
   // The result of this ONLY fits in an extended, and the Variable is
Extended, the constants are explicitly defined as above,  why is the
precision of the result reduced? 

   KK := A_Const+Extended(B_Const/C_Const);
   // Here I’m trying to define that the result of the division should be
stored as an extended.

   II := A_Const+B_Const/C_Const; 
   // I really expected this to work without all the typecasting, because
the constants are defined the way I want them to be. 

   JJ := Extended(A_Const+B_Const/C_Const);
   // Here I am explicitly defining the result of the calculation to be
Extended, why doesn’t this work?

   LL := Extended(A_Const)+Extended(B_Const)/Extended(C_Const);
   // This is what I need to do to get the results I want, but I don’t
understand why.  Why does the integer need to be converted to floating point
here?

   WRITELN ( ' A_const = ',A_Const) ;                            
   //  A_const = 8427                      //Integer

   WRITELN ( '   A_var = ',A_Var) ;
  //    A_var = 8427                      //Integer

   WRITELN ( ' B_const = ',B_Const) ;
  //  B_const = 33                        //Byte

   WRITELN ( '   B_var = ',B_Var) ;
  //    B_var = 33                        //Byte

   WRITELN ( ' C_const = ',C_Const: 20 : 20 ) ;
  //  C_const = 1440.50000000000000000000 //Single

   WRITELN ( '   C_var = ',C_Var: 20 : 20 ) ;
  //    C_var = 1440.50000000000000000000 //Single

   WRITELN ( '      FF = ',FF:20:20 ,'  FF-FF = ',FF-FF:20:20) ; 
   //       FF = 8427.02290871225268987000  FF-FF = 0.00000000000000000000
   //This is what I expect

   WRITELN ( '      GG = ',GG:20:20 ,'  FF-GG = ',FF-GG:20:20) ; 
   //       GG = 8427.02290871225268987000  FF-GG = 0.00000000000000000000
   //This is what I expect

   WRITELN ( '      HH = ',HH:20:20 ,'  FF-HH = ',FF-HH:20:20) ; 
   //       HH = 8427.02246093750000000000  FF-HH = 0.00044777475268986677
   //I don't understand why this is different from GG?  It's an Int + Byte /
Single and cast the same way

   WRITELN ( '      II = ',II:20:20 ,'  FF-II = ',FF-II:20:20) ; 
   //       II = 8427.02246093750000000000  FF-II = 0.00044777475268986677
   //I don't understand why this is different from FF?  It's an Int + Byte /
Single

   WRITELN ( '      JJ = ',JJ:20:20 ,'  FF-JJ = ',FF-JJ:20:20) ; 
   //       JJ = 8427.02246093750000000000  FF-JJ = 0.00044777475268986677 
   //Why doesn't this casting work?   I'm saying I want the result in an
Extended.

   WRITELN ( '      KK = ',KK:20:20 ,'  FF-KK = ',FF-KK:20:20) ; 
   //       KK = 8427.02290871180593967000  FF-KK = 0.00000000044675019240
   //Why is this off a little?  I am casting the division to be Extended.

   WRITELN ( '      LL = ',KK:20:20 ,'  FF-LL = ',FF-LL:20:20) ; 
   //       LL = 8427.02290871180593967000  FF-LL = 0.00000000000000000000
   //Why do I need to re-cast each constant as Extended? it’s not what I
really want, I want to add an integer to a byte divided by a single.. do it
correctly and store it as Extended.

End.

A_const = 8427
   A_var = 8427
B_const = 33
   B_var = 33
C_const = 1440.50000000000000000000
   C_var = 1440.50000000000000000000
      FF = 8427.02290871225268987000  FF-FF = 0.00000000000000000000
      GG = 8427.02290871225268987000  FF-GG = 0.00000000000000000000
      HH = 8427.02246093750000000000  FF-HH = 0.00044777475268986677
      II = 8427.02246093750000000000  FF-II = 0.00044777475268986677
      JJ = 8427.02246093750000000000  FF-JJ = 0.00044777475268986677
      KK = 8427.02290871180593967000  FF-KK = 0.00000000044675019240
      LL = 8427.02290871180593967000  FF-LL = 0.00000000000000000000

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20240212/2bf7a83e/attachment-0001.htm>