[fpc-pascal] Functions for dealing with floating-point precision
Ekkehard Domning
edo at domis.de
Sat Sep 14 18:41:33 CEST 2024
Dear Thomas,
thank your for the excellent approach for solving the problem, finding
the closest smaller or greater floating point value to a given one.
We had an descent discussion about the same topic on "LazarusForum.de"
https://www.lazarusforum.de/viewtopic.php?f=10&t=16019&hilit=kleinste&start=15
where your solution comes up.
I packed your code into a working unit, so that it may be easier to
include into other projects.
While performing an intense testing, I found two minor bugs.
The first one is, that the const MinDouble from the unit math.pas
is not the smallest Double which could be generated. Thus the code will
bypass all values between MinDouble (2.2250738585072014e-308) and
4.9406564584124654E-324
The second one is that the usage of the const MaxSingle, will not work
as expected. The compiler will obviously convert MaxSingle into a Double
value and subsequently calling the related Double functions, not as
expected the Single function.
Example:
CopySign (MaxSingle, Start)
will call
function CopySign (x: Double; y: Double): Double;
and not
function CopySign (x: Single; y: Single): Single;
This will luckily not lead into errors.
In the function NextAfter
if (toZero) then begin
(...)
end else begin
// we are increasing the magnitude, toward +-Infinity
if (Start = 0.0) then Exit (CopySign (MinSingle, Direction));
if (absStart = MaxSingle) then Exit (CopySign (Infinity, Start));
Exit (CopySign (LongBitsToSingle(Succ(SingleToLongBits(absStart))),
Start));
end;
this conversion leads to an unwanted behaviour in the line:
if (absStart = MaxSingle) then Exit (CopySign (Infinity, Start));
since the term (absStart = MaxSingle) will never become true, even if
NextAfter is called with a value containing MaxSingle. The reason is
that
Single(MaxSingle) <> MaxSingle
Surprisingly the function is working and delivers the correct result,
since the following line
CopySign (LongBitsToSingle(Succ(SingleToLongBits(absStart))),Start));
executed absStart containing MaxSingle, will deliver Infinity as expected.
So I decided to remove this line.
The unit contains some new defined consts and uses them to bypass the
described problems
MinMinDouble = 4.9406564584124654E-324;
SingleMaxSingle : Single = Single(MaxSingle);
NegSingleMaxSingle : Single = Single(-MaxSingle);
I attached the unit, which is also slightly modified due to my different
pascal formatting style.
Thank you very much for sharing.
Best regards
Ekkehard Domning
-------------- next part --------------
{
Unit UNextFloat
a very close adaption of the Source found at
https://www.mail-archive.com/fpc-pascal@lists.freepascal.org/msg55419.html
The two mainfunction are
NextAfter
and
ULP
NextAfter is an replacement for the nextafter function found in the "msvcrt.dll"
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions?view=msvc-170
It returns the next closest float (single or double) value to a Start-value
to the Direction-value
ULP is "unit in the last place or unit of least precision (ulp) is the spacing
between two consecutive floating-point numbers, i.e., the value the least
significant digit (rightmost digit) represents if it is 1."
https://en.wikipedia.org/wiki/Unit_in_the_last_place
-=-=-=-Start of Page excerpt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
[fpc-pascal] Functions for dealing with floating-point precision
Thomas Kurz via fpc-pascal Sun, 19 Jun 2022 13:38:48 -0700
Hello,
in my program I have need for checking floating-point precision. I'm internally
using floating-points for calculations, but in the end I have to use integer
numbers. I cannot use Round() because I have to check for thresholds. I.e. that
I wish to accept a value of 1.9999.... as being ">=2". As you can see, I cannot
use Trunc() either, so I decided to check for the difference in ULPs (unit in
the last place).
I couldn't find the corresponding functions in the RTL, but I think they should
be included. I wrote the functions listed below. I would appreciate if the
maintainers of FPC decide to add them to the Math unit. I skipped the functions
for the Extended type, but if it's being desired, I can add hand them in later.
Kind regards,
Thomas
-=-=-=-End of Page excerpt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Version Date Who Change
0.0.1 2024-09-05 EDo First light
0.0.2 2024-09-05 EDo Bugfix MinDouble, introducing MinMinDouble = 4.9406564584124654E-324
0.0.3 2024-09-06 EDo Bugfix in usage of MaxSingle (see comments below)
Introducing two typed consts SingleMaxSingle and
NegSingleMaxSingle to ease the usage.
}
unit unextfloat;
interface
uses Math;
// CAUTION NextAfter for single values is not compiled into the code if e.g
// NextAfter(MaxSingle,-MaxSingle)
// is written. In this case the compiler uses the overloaded Double version!
// To ensure the usage of the single parameter function
// always cast MaxSingle to Single (Single(MaxSingle), Single(-MaxSingle))
// or use the below defined SingleMaxSingle and NegSingleMaxSingle constants.
function NextAfter(const Start: Single; const Direction: Single): Single;
function ULP(Argument: Single): Single;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function NextAfter(const Start: Double; const Direction: Double): Double;
function ULP(Argument: Double): Double;
{$endif FPC_HAS_TYPE_DOUBLE}
// Other functions, probably only needed internal
function SingleToLongBits(const Argument: Single): UInt32;
function LongBitsToSingle(const Argument: UInt32): Single;
// returns the first floating-point argument with the sign of the second floating-point argument
function CopySign(const x: Single; const y: Single): Single;
function IsSameSign(const x: Single; const y: Single): Boolean;
function FloatPrior(const Argument: Single): Single;
function FloatNext(const Argument: Single): Single;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function DoubleToLongBits(const Argument: Double): UInt64;
function LongBitsToDouble(Argument: UInt64): Double;
// returns the first floating-point argument with the sign of the second floating-point argument
function CopySign(const x: Double; const y: Double): Double;
function IsSameSign(const x: Double; const y: Double): Boolean;
function FloatPrior(const Argument: Double): Double;
function FloatNext(Argument: Double): Double;
{$endif FPC_HAS_TYPE_DOUBLE}
const
MinMinDouble = 4.9406564584124654E-324;
MinMinSingle = 1.40129846E-45;
// MaxSingle failed in equal comparations!
// singlevalue := MaxSingle;
// equal := (singlevalue = MaxSingle);
// will give equal = false !!
// This is caused obviously by converting MaxSingle into a double value,
// by the compiler, prior the comparism
// coding
// equal := (singlevalue = Single(MaxSingle));
// will give equal = true!!
// To bypass the casting two const are introduced to be used
// instead MaxSingle and -MaxSingle
SingleMaxSingle : Single = Single(MaxSingle);
NegSingleMaxSingle : Single = Single(-MaxSingle);
implementation
const MAX_ULP_SINGLE = 2.028240960E+31;
const MAX_ULP_DOUBLE = 1.9958403095347198E+292;
function SingleToLongBits(const Argument: Single): UInt32;
var
argResult : Cardinal absolute Argument;
begin
Result := ArgResult;
end;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function DoubleToLongBits(const Argument: Double): UInt64;
var
argResult : UInt64 absolute Argument;
begin
Result := argResult;
end;
{$endif FPC_HAS_TYPE_DOUBLE}
function LongBitsToSingle(const Argument: UInt32): Single;
var
argResult : Single absolute Argument;
begin
Result := argResult;
end;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function LongBitsToDouble(Argument: UInt64): Double;
var
argResult : Double absolute Argument;
begin
Result := argResult;
end;
{$endif FPC_HAS_TYPE_DOUBLE}
// returns the first floating-point argument with the sign of the second floating-point argument
function CopySign(const x: Single; const y: Single): Single;
begin
Result := (Abs(x) * Sign(y));
end;
{$ifdef FPC_HAS_TYPE_DOUBLE}
// returns the first floating-point argument with the sign of the second floating-point argument
function CopySign(const x: Double; const y: Double): Double;
begin
Result := (Abs(x) * Sign(y));
end;
{$endif FPC_HAS_TYPE_DOUBLE}
function IsSameSign(const x: Single; const y: Single): Boolean;
begin
Result := (CopySign(x, y) = x);
end;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function IsSameSign(const x: Double; const y: Double): Boolean;
begin
Result := (CopySign(x, y) = x);
end;
{$endif FPC_HAS_TYPE_DOUBLE}
function NextAfter(const Start: Single; const Direction: Single): Single;
var
absStart: Single;
absDir: Single;
toZero: Boolean;
begin
// If either argument is a NaN, then NaN is returned.
if IsNan(Start) or IsNan(Direction) then Exit(NaN);
// If both arguments compare as equal the second argument is returned.
if (Start = Direction) then Exit(Direction);
absStart := Abs(Start);
absDir := Abs(Direction);
toZero := not IsSameSign(Start, Direction) or (AbsDir < absStart);
if (toZero) then
begin
// we are reducing the magnitude, going toward zero.
if (absStart = MinMinSingle) then
Exit(CopySign(0.0, Start));
if IsInfinite(absStart) then
Exit(CopySign(SingleMaxSingle, Start));
Exit(CopySign(LongBitsToSingle(Pred(SingleToLongBits(absStart))), Start));
end
else
begin
// we are increasing the magnitude, toward +-Infinity
if (Start = 0.0) then
Exit(CopySign(MinMinSingle, Direction));
{
// This comparism does not work
// if (absStart = MaxSingle) then
// to get it work it must be coded
if (absStart = SingleMaxSingle) then
Exit(CopySign(Infinity, Start));
// But since the next Value after MaxSingle is Infinity, the whole comparism is not needed!
}
Exit(CopySign(LongBitsToSingle(Succ(SingleToLongBits(absStart))), Start));
end;
end;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function NextAfter(const Start: Double; const Direction: Double): Double;
var
absStart: Double;
absDir: Double;
toZero: Boolean;
begin
// If either argument is a NaN, then NaN is returned.
if IsNan(Start) or IsNan(Direction) then Exit(NaN);
// If both arguments compare as equal the second argument is returned.
if (Start = Direction) then Exit(Direction);
absStart := Abs(Start);
absDir := Abs(Direction);
toZero := not IsSameSign(Start, Direction) or (AbsDir < AbsStart);
if (toZero) then
begin
// we are reducing the magnitude, going toward zero.
if (absStart = MinMinDouble) then
Exit(CopySign(0.0, Start));
if IsInfinite(absStart) then
Exit(CopySign(MaxDouble, Start));
Exit(CopySign(LongBitsToDouble(Pred(DoubleToLongBits(absStart))),Start));
end
else
begin
// we are increasing the magnitude, toward +-Infinity
if (Start = 0.0) then
Exit(CopySign(MinMinDouble, Direction));
if (AbsStart = MaxDouble) then
Exit(CopySign(Infinity, Start));
Exit(CopySign(LongBitsToDouble(Succ(DoubleToLongBits(absStart))), Start));
end;
end;
{$endif FPC_HAS_TYPE_DOUBLE}
function FloatPrior(const Argument: Single): Single;
begin
Result := NextAfter(Argument, NegInfinity);
end;
function FloatNext(const Argument: Single): Single;
begin
Result := NextAfter(Argument, Infinity);
end;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function FloatPrior(const Argument: Double): Double;
begin
Result := NextAfter(Argument, NegInfinity);
end;
{$endif FPC_HAS_TYPE_DOUBLE}
{$ifdef FPC_HAS_TYPE_DOUBLE}
function FloatNext(Argument: Double): Double;
begin
Result := NextAfter(Argument, Infinity);
end;
{$endif FPC_HAS_TYPE_DOUBLE}
function ULP(Argument: Single): Single;
begin
// If the argument is NaN, then the result is NaN.
if IsNaN(Argument) then Exit(NaN);
// If the argument is positive or negative infinity, then the result is positive infinity.
if IsInfinite(Argument) then Exit(Infinity);
// If the argument is positive or negative zero, then the result is Single.MIN_VALUE.
if (Argument = 0.0) then Exit(MinMinSingle);
Argument := Abs(Argument);
// If the argument is ±Single.MAX_VALUE, then the result is equal to 2^104.
if (Argument = SingleMaxSingle) then
Exit(MAX_ULP_SINGLE);
Result := FloatNext(Argument) - Argument;
end;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function ULP(Argument: Double): Double;
begin
// If the argument is NaN, then the result is NaN.
if IsNaN(Argument) then Exit(NaN);
// If the argument is positive or negative infinity, then the result is positive infinity.
if IsInfinite(Argument) then Exit(Infinity);
// If the argument is positive or negative zero, then the result is Double.MIN_VALUE.
if (Argument = 0.0) then Exit(MinMinDouble);
Argument := Abs(Argument);
// If the argument is ±Double.MAX_VALUE, then the result is equal to 2^971.
if (Argument = MaxDouble) then Exit(MAX_ULP_DOUBLE);
Result := FloatNext(Argument) - Argument;
end;
{$endif FPC_HAS_TYPE_DOUBLE}
end.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20240914/daca7d82/attachment.sig>
More information about the fpc-pascal
mailing list