[fpc-pascal] Floating point question
James Richters
james.richters at productionautomation.net
Mon Feb 12 13:33:57 CET 2024
>>Overall, the intermediate float precision is a very difficult topic.
I agree it's a difficult topic, it all comes down to what your program is
doing, and whether you need performance or precision.
>>And generate the slowest code possible on most platforms.
I can appreciate the need to reduce precision where it's possible for the
sake of performance, especially when it won't make any difference.
What makes it difficult is there are many different reasons for wanting it
one way, or the other, it depends on the purpose of the program, and the
compiler has no way to know what the purpose is.
It occurs to me that one could want part of a program to be optimized for
performance and another part of the same program to be optimized for
precision, for example if you are doing calculations to generate geometry,
and also want to display the geometry on the screen, the data you write out
to a file you would want maximum precision, but since what you will display
on the screen will eventually become only integer values of pixels you want
to do that math as fast as possible, especially if you want to pan / zoom /
rotate, and even though what the screen data is based on might be double
precision or more, I can see how reducing its precision as fast as possible
would be beneficial to increase performance.
So Im trying to learn something, I agree it would be better have
performance where its possible and precision when needed. But I just don't
understand what is going on. I'm not trying to say that this reduction in
precision should not be done, I'm understanding the value in it. Im trying
to figure out why the math done with constants where the compiler is doing
the math is not the same as when the program does with math with variables.
If the solution is to typecast where needed to get the desired results, they
why isnt it working the way I expect it to?
Below is a sample program, Im not trying to make everything extended, in
fact quite the opposite, there is no need for the input constants /
variables to be Extended because they all fit perfectly in smaller data
types, so I put them all into smaller datatypes as an example. I am
defining constants explicitly and defining variables the exact same way, so
Im comparing apples to apples here, I have A as always an Integer, B as
always a Byte, and C as always a single, with a value the fits in a single.
My goal is to add the integer to a byte thats been divided by a single and
get the result in Extended. When I do this with the variables, everything
is as I expected, when I do this with constants, its not as I expect.
This is what I dont understand, and if this worked as expected then I think
everyone is happy. What ever is happening for it to work correctly during
program execution should also be happening when the compiler does the math.
The problem isnt that the constants got stored in lower precision its that
they are somehow forcing the result of the calculation to also be at the
lower precision and not re-evaluated after the math. Its completely
legitimate to divide a low precision number by a low precision number and
get a high precision result, it works with Variables, why doesnt it work
with Constants?
I suspect that whats happened is that there is something missing in the way
the compiler does math, something that is not needed if it was always done
at maximum precision, but that is needed with mixed precision. Its not
that the fact that the constants were reduced in precision, its something
to do with the way the math is done with constants of reduced precision that
isnt being accounted for, and that is not necessary if calculating with
full precision. Its not that the changes in 2.2 are the problem at all,
its that something else needed to be done at the same time that was missed.
The only way I can get the correct result when using constants is to re-cast
ALL of them as extended, not just the ones involving division, and not the
entire formula, but every single constant. This is what I dont
understand.
>>The evaluation of the expression on the right of := does not know (and
should not know) what the type is of the expression on the left.
Why cant the compiler do tall the math at full precision and then evaluate
only the result to see if that can be stored in a lower precision. If the
expression on the right cannot and should not know the type on the left,
then there is a good possibility that its a high precision data type, and
then there should be some provision to safeguard against data loss if the
type is of high precision.
Why doesnt this work? JJ := Extended(A_Const+B_Const/C_Const); It
requires no knowledge of what is on the left.
Why cant the math be done with high precision and the result be reduced to
the smallest datatype, Math with low precision data types often results in
high precision results.
If I want to have a mixed program with portions in high precision and
portions that are highest performance possible, then what is the correct way
to accomplish the precision portions? Are we supposed to re-cast every
constant at highest precision in every formula to make sure we dont lose
data?
This doesnt need to be done with Variables, why does it need to be done
with constants?
Please see my comments in the sample program. I hope it is readable, because
sometimes e-mail breaks lines where I dont intend it to.
James
program Const_Vs_Var;
Const
A_const = Integer(8427);
B_const = Byte(33);
C_const = Single(1440.5);
Var
A_Var : Integer;
B_Var : Byte;
C_Var : Single;
FF, GG, HH, II, JJ, KK, LL : Extended;
Begin
A_Var := A_Const;
B_Var := B_Const;
C_Var := C_Const;
FF := A_Var+B_Var/C_Var;
// This is the baseline, The math done with variables comes out the way I
expect it to.
GG := Integer(A_Var)+Byte(B_Var)/Single(C_Var);
// This is just for emphasis that I am doing the math with the data types
explicitly defined and I get the correct results.
HH := Integer(A_Const)+Byte(B_Const)/Single(C_Const);
// The result of this ONLY fits in an extended, and the Variable is
Extended, the constants are explicitly defined as above, why is the
precision of the result reduced?
KK := A_Const+Extended(B_Const/C_Const);
// Here Im trying to define that the result of the division should be
stored as an extended.
II := A_Const+B_Const/C_Const;
// I really expected this to work without all the typecasting, because
the constants are defined the way I want them to be.
JJ := Extended(A_Const+B_Const/C_Const);
// Here I am explicitly defining the result of the calculation to be
Extended, why doesnt this work?
LL := Extended(A_Const)+Extended(B_Const)/Extended(C_Const);
// This is what I need to do to get the results I want, but I dont
understand why. Why does the integer need to be converted to floating point
here?
WRITELN ( ' A_const = ',A_Const) ;
// A_const = 8427 //Integer
WRITELN ( ' A_var = ',A_Var) ;
// A_var = 8427 //Integer
WRITELN ( ' B_const = ',B_Const) ;
// B_const = 33 //Byte
WRITELN ( ' B_var = ',B_Var) ;
// B_var = 33 //Byte
WRITELN ( ' C_const = ',C_Const: 20 : 20 ) ;
// C_const = 1440.50000000000000000000 //Single
WRITELN ( ' C_var = ',C_Var: 20 : 20 ) ;
// C_var = 1440.50000000000000000000 //Single
WRITELN ( ' FF = ',FF:20:20 ,' FF-FF = ',FF-FF:20:20) ;
// FF = 8427.02290871225268987000 FF-FF = 0.00000000000000000000
//This is what I expect
WRITELN ( ' GG = ',GG:20:20 ,' FF-GG = ',FF-GG:20:20) ;
// GG = 8427.02290871225268987000 FF-GG = 0.00000000000000000000
//This is what I expect
WRITELN ( ' HH = ',HH:20:20 ,' FF-HH = ',FF-HH:20:20) ;
// HH = 8427.02246093750000000000 FF-HH = 0.00044777475268986677
//I don't understand why this is different from GG? It's an Int + Byte /
Single and cast the same way
WRITELN ( ' II = ',II:20:20 ,' FF-II = ',FF-II:20:20) ;
// II = 8427.02246093750000000000 FF-II = 0.00044777475268986677
//I don't understand why this is different from FF? It's an Int + Byte /
Single
WRITELN ( ' JJ = ',JJ:20:20 ,' FF-JJ = ',FF-JJ:20:20) ;
// JJ = 8427.02246093750000000000 FF-JJ = 0.00044777475268986677
//Why doesn't this casting work? I'm saying I want the result in an
Extended.
WRITELN ( ' KK = ',KK:20:20 ,' FF-KK = ',FF-KK:20:20) ;
// KK = 8427.02290871180593967000 FF-KK = 0.00000000044675019240
//Why is this off a little? I am casting the division to be Extended.
WRITELN ( ' LL = ',KK:20:20 ,' FF-LL = ',FF-LL:20:20) ;
// LL = 8427.02290871180593967000 FF-LL = 0.00000000000000000000
//Why do I need to re-cast each constant as Extended? its not what I
really want, I want to add an integer to a byte divided by a single.. do it
correctly and store it as Extended.
End.
A_const = 8427
A_var = 8427
B_const = 33
B_var = 33
C_const = 1440.50000000000000000000
C_var = 1440.50000000000000000000
FF = 8427.02290871225268987000 FF-FF = 0.00000000000000000000
GG = 8427.02290871225268987000 FF-GG = 0.00000000000000000000
HH = 8427.02246093750000000000 FF-HH = 0.00044777475268986677
II = 8427.02246093750000000000 FF-II = 0.00044777475268986677
JJ = 8427.02246093750000000000 FF-JJ = 0.00044777475268986677
KK = 8427.02290871180593967000 FF-KK = 0.00000000044675019240
LL = 8427.02290871180593967000 FF-LL = 0.00000000000000000000
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20240212/2bf7a83e/attachment-0001.htm>
More information about the fpc-pascal
mailing list