[fpc-devel] Faster trig functions
J. Gareth Moreton
gareth at moreton-family.com
Sat Jun 29 03:05:51 CEST 2024
Hi everyone,
So for a few weeks now, I've been playing with the idea of speeding up
the trig functions in Free Pascal, or at least the trig functions for
Single-precision which find uses in graphical applications where deep
precision is not absolutely necessary, but speed is paramount.
Nevertheless, I sought to preserve the precision when compared with the
current generic algorithm.
Note that this only works for Single-precision floating-point, as my
assembly routines make use of Double-precision for intermediate
calculations but cannot eliminate rounding errors due to the operations
utilised.
These are my current metrics for the sin function:
Single-precision sin function assembly language candidates:
Single-precision sin function assembly language candidates:
Arg. Generic function Deprecated x87 operation
SSE2 optimised SSE4 optimised AVX
optimised FMA optimised
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 0.0000000000 3.91 ns 0.0000000000 27.20 ns
0.0000000000 2.91 ns 0.0000000000 2.29 ns 0.0000000000 2.49
ns 0.0000000000 2.71 ns
pi/4 0.7071067691 49.19 ns 0.7071067691 55.40 ns
0.7071067691 15.31 ns 0.7071067691 15.11 ns 0.7071067691
14.60 ns 0.7071067691 12.60 ns
1 0.8414709568 49.29 ns 0.8414709568 55.81 ns
0.8414709568 15.09 ns 0.8414709568 15.41 ns 0.8414709568
14.40 ns 0.8414709568 13.09 ns
pi/2 1.0000000000 266.82 ns 1.0000000000 49.00 ns
1.0000000000 15.28 ns 1.0000000000 14.92 ns 1.0000000000
14.70 ns 1.0000000000 12.70 ns
3pi/4 0.7071067691 48.19 ns 0.7071067691 49.00 ns
0.7071067691 15.21 ns 0.7071067691 14.99 ns 0.7071067691
14.40 ns 0.7071067691 12.11 ns
3 0.1411200017 48.80 ns 0.1411200017 41.38 ns
0.1411200017 15.41 ns 0.1411200017 14.31 ns 0.1411200017
15.41 ns 0.1411200017 12.60 ns
pi -0.0000000874 266.50 ns -0.0000000874 40.99 ns
-0.0000000874 15.19 ns -0.0000000874 14.31 ns -0.0000000874
14.31 ns -0.0000000874 12.60 ns
5pi/4 -0.7071067095 46.70 ns -0.7071067095 48.80 ns
-0.7071067095 15.80 ns -0.7071067095 15.70 ns -0.7071067095
14.89 ns -0.7071067095 12.11 ns
3pi/2 -1.0000000000 268.90 ns -1.0000000000 49.00 ns
-1.0000000000 15.31 ns -1.0000000000 14.99 ns -1.0000000000
15.01 ns -1.0000000000 12.08 ns
7pi/4 -0.7071068883 49.19 ns -0.7071068883 54.81 ns
-0.7071068883 15.89 ns -0.7071068883 14.31 ns -0.7071068883
14.50 ns -0.7071068883 12.99 ns
2pi 0.0000001748 266.31 ns 0.0000001748 40.60 ns
0.0000001748 15.41 ns 0.0000001748 14.40 ns 0.0000001748
14.40 ns 0.0000001748 14.28 ns
-pi/4 -0.7071067691 50.00 ns -0.7071067691 56.62 ns
-0.7071067691 15.28 ns -0.7071067691 15.01 ns -0.7071067691
14.89 ns -0.7071067691 12.30 ns
-pi/2 -1.0000000000 273.80 ns -1.0000000000 50.20 ns
-1.0000000000 15.99 ns -1.0000000000 14.40 ns -1.0000000000
14.60 ns -1.0000000000 12.82 ns
-3pi/4 0.7071067691 48.29 ns 0.7071067691 50.49 ns
0.7071067691 15.82 ns 0.7071067691 14.48 ns 0.7071067691
14.50 ns 0.7071067691 14.31 ns
-pi 0.0000000874 274.19 ns 0.0000000874 41.50 ns
0.0000000874 15.01 ns 0.0000000874 14.60 ns 0.0000000874
14.38 ns 0.0000000874 12.21 ns
-0 -0.0000000000 4.30 ns -0.0000000000 28.10 ns
-0.0000000000 2.81 ns -0.0000000000 2.49 ns -0.0000000000 2.42
ns -0.0000000000 2.88 ns
(The fact that sin 2π ≠ 0 is due to the inability to store 2π itself
with satisfactory precision, so it enters the function with a value that
won't reduce to exactly zero modulo 2π)
Even against the deprecated FSIN function, my algorithm is about three
times faster. My algorithm calculates sin using a factorised MacLaurin
expansion up to the 1/13! term (enough to calculate sin π for
Single-precision) as well as performing some domain reduction so the
argument is kept small.
Essentially, sin x = x - x^3 /3! + x^5 /5! - x^7 /7! + x^9 /9! - x^11
/11! + x^13 /13!, which is factorised into x(1 - x^2 (1/3! - x^2 (1/5! -
x^2 (1/7! - x^2 (1/9! - x^2 (1/11! - x^2 /13!)))))), and from here, x^2
can be precomputed and reused, while the reciprocal factorials are just
constants.
SSE2-optimised uses only instructions that exist on processors that
support SS2 and nothing more (the minimum requirement for 64-bit
Windows), but utilises some clever branchless bit manipulation on the
domain reduction step. All in all, other than an initial sanity check
(exits immediately, passing the argument to the result if said argument
is zero, negative zero or a NaN), the routines contain no branching or
subroutine calls (they're leaf functions).
SSE4-optimised takes advantage of the ROUNDSD instruction to shave off a
few cycles of processing during the domain reduction. AVX-optimised
takes SSE4, translates it to AVX and also parallelises a fair bit of it,
since the AVX instruction set is much more flexible in where results are
written (I did try using horizontal adding at one point (SSE3 / AVX),
but these instructions are rather slow and the end result was a
performance loss). The slight slowdown seems to be due to unrelated
state switching more than anything, but is hard to benchmark. I did try
to introduce this parallelisation to the SSE versions, but trying to
manipulate the registers into an ideal form added too much overhead.
Finally, FMA finds some instances in the AVX implementation where a
fused multiply-add can be used, reducing overall processing time by
about 10%.
At the same time, I've also written a faster version that calculates up
to only the 1/9! term, which introduces some imprecision, but it is
still good for up to 5 or 6 decimal places.
Single-precision fast sin function assembly language candidates:
Arg. Generic function Deprecated x87 operation
SSE2 optimised SSE4 optimised AVX
optimised FMA optimised
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 0.0000000000 4.32 ns 0.0000000000 29.88 ns
0.0000000000 12.82 ns 0.0000000000 11.69 ns 0.0000000000
12.18 ns 0.0000000000 10.21 ns
pi/4 0.7071067691 49.90 ns 0.7071067691 57.59 ns
0.7071067691 12.62 ns 0.7071067691 11.60 ns 0.7071067691
11.99 ns 0.7071067691 10.40 ns
1 0.8414709568 49.61 ns 0.8414709568 57.89 ns
0.8414710164 13.11 ns 0.8414710164 11.89 ns 0.8414710164
12.50 ns 0.8414710164 11.11 ns
pi/2 1.0000000000 266.11 ns 1.0000000000 50.00 ns
1.0000035760 12.50 ns 1.0000035760 12.60 ns 1.0000035760
12.18 ns 1.0000035760 10.40 ns
3pi/4 0.7071067691 52.91 ns 0.7071067691 50.29 ns
0.7071067691 12.82 ns 0.7071067691 12.50 ns 0.7071067691
12.28 ns 0.7071067691 13.01 ns
3 0.1411200017 48.10 ns 0.1411200017 43.41 ns
0.1411200017 12.70 ns 0.1411200017 11.69 ns 0.1411200017
12.60 ns 0.1411200017 10.60 ns
pi -0.0000000874 267.41 ns -0.0000000874 42.31 ns
-0.0000000874 12.99 ns -0.0000000874 12.50 ns -0.0000000874
13.01 ns -0.0000000874 10.89 ns
5pi/4 -0.7071067095 46.22 ns -0.7071067095 50.59 ns
-0.7071067095 13.21 ns -0.7071067095 11.69 ns -0.7071067095
12.30 ns -0.7071067095 10.30 ns
3pi/2 -1.0000000000 270.90 ns -1.0000000000 49.19 ns
-1.0000035760 12.70 ns -1.0000035760 12.11 ns -1.0000035760
14.09 ns -1.0000035760 11.62 ns
7pi/4 -0.7071068883 48.88 ns -0.7071068883 57.20 ns
-0.7071068883 12.50 ns -0.7071068883 12.60 ns -0.7071068883
12.21 ns -0.7071068883 10.30 ns
2pi 0.0000001748 270.51 ns 0.0000001748 42.80 ns
0.0000001748 12.70 ns 0.0000001748 11.91 ns 0.0000001748
14.18 ns 0.0000001748 10.30 ns
-pi/4 -0.7071067691 50.00 ns -0.7071067691 57.59 ns
-0.7071067691 12.82 ns -0.7071067691 11.89 ns -0.7071067691
12.11 ns -0.7071067691 10.79 ns
-pi/2 -1.0000000000 272.19 ns -1.0000000000 49.61 ns
-1.0000035760 15.89 ns -1.0000035760 11.82 ns -1.0000035760
12.60 ns -1.0000035760 10.40 ns
-3pi/4 0.7071067691 48.00 ns 0.7071067691 52.10 ns
0.7071067691 12.60 ns 0.7071067691 11.89 ns 0.7071067691
12.60 ns 0.7071067691 10.72 ns
-pi 0.0000000874 272.49 ns 0.0000000874 42.50 ns
0.0000000874 13.50 ns 0.0000000874 11.69 ns 0.0000000874
12.21 ns 0.0000000874 10.50 ns
-0 -0.0000000000 3.91 ns -0.0000000000 29.30 ns
-0.0000000000 13.31 ns -0.0000000000 12.60 ns -0.0000000000
12.18 ns -0.0000000000 10.21 ns
The fast versions also trim off the sanity check at the start on the
assumption that the argument has an equal chance of being anywhere
between -π and π and not a non-sensical NaN.
If there's support, I would like to consider implementing the 1/13!
version for a new internal "fpc_sin_single" routine in the future
(Double types will still use 'fpc_sin_real), and I will develop a
similar routine on AArch64 in the future. My fast and slightly
imprecise 1/9! version will probably be part of a third-party library.
Future work will focus on writing an equivalent implementation for cos
and also SinCos, since the implementations that don't take advantage of
parallelisation (currently SSE2 and SSE4) with the upper half of the XMM
registers can easily be extended to compute cos in these lanes simply by
adding π/2 to the initial argument.
For tan, I'm not sure yet what the best approach is, whether to just
straight-up calculate the MacLaurin expansion (which is slightly more
complicated) and engage in a tighter domain reduction (tan is periodic
about -π/2 and π/2 rather than -π and π) or simply take the result of
SinCos and divide the result of the lower lane by the result of the
upper lane (tan x = sin x / cos x). Metric testing will tell. atan2
will be another one to consider as well.
Kit
--
This email has been checked for viruses by Avast antivirus software.
www.avast.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20240629/068eda2d/attachment-0001.htm>
More information about the fpc-devel
mailing list