[fpc-devel] Faster trig functions

J. Gareth Moreton gareth at moreton-family.com
Sat Jun 29 03:05:51 CEST 2024


Hi everyone,

So for a few weeks now, I've been playing with the idea of speeding up 
the trig functions in Free Pascal, or at least the trig functions for 
Single-precision which find uses in graphical applications where deep 
precision is not absolutely necessary, but speed is paramount.  
Nevertheless, I sought to preserve the precision when compared with the 
current generic algorithm.

Note that this only works for Single-precision floating-point, as my 
assembly routines make use of Double-precision for intermediate 
calculations but cannot eliminate rounding errors due to the operations 
utilised.

These are my current metrics for the sin function:

Single-precision sin function assembly language candidates:

Single-precision sin function assembly language candidates:

  Arg.       Generic function        Deprecated x87 operation         
SSE2 optimised              SSE4 optimised               AVX 
optimised               FMA optimised
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    0    0.0000000000    3.91 ns     0.0000000000    27.20 ns 
0.0000000000    2.91 ns     0.0000000000    2.29 ns 0.0000000000    2.49 
ns     0.0000000000    2.71 ns
   pi/4  0.7071067691    49.19 ns    0.7071067691    55.40 ns 
0.7071067691    15.31 ns    0.7071067691    15.11 ns 0.7071067691    
14.60 ns    0.7071067691    12.60 ns
    1    0.8414709568    49.29 ns    0.8414709568    55.81 ns 
0.8414709568    15.09 ns    0.8414709568    15.41 ns 0.8414709568    
14.40 ns    0.8414709568    13.09 ns
   pi/2  1.0000000000    266.82 ns   1.0000000000    49.00 ns 
1.0000000000    15.28 ns    1.0000000000    14.92 ns 1.0000000000    
14.70 ns    1.0000000000    12.70 ns
  3pi/4  0.7071067691    48.19 ns    0.7071067691    49.00 ns 
0.7071067691    15.21 ns    0.7071067691    14.99 ns 0.7071067691    
14.40 ns    0.7071067691    12.11 ns
    3    0.1411200017    48.80 ns    0.1411200017    41.38 ns 
0.1411200017    15.41 ns    0.1411200017    14.31 ns 0.1411200017    
15.41 ns    0.1411200017    12.60 ns
   pi    -0.0000000874   266.50 ns   -0.0000000874   40.99 ns 
-0.0000000874   15.19 ns    -0.0000000874   14.31 ns -0.0000000874   
14.31 ns    -0.0000000874   12.60 ns
  5pi/4  -0.7071067095   46.70 ns    -0.7071067095   48.80 ns 
-0.7071067095   15.80 ns    -0.7071067095   15.70 ns -0.7071067095   
14.89 ns    -0.7071067095   12.11 ns
  3pi/2  -1.0000000000   268.90 ns   -1.0000000000   49.00 ns 
-1.0000000000   15.31 ns    -1.0000000000   14.99 ns -1.0000000000   
15.01 ns    -1.0000000000   12.08 ns
  7pi/4  -0.7071068883   49.19 ns    -0.7071068883   54.81 ns 
-0.7071068883   15.89 ns    -0.7071068883   14.31 ns -0.7071068883   
14.50 ns    -0.7071068883   12.99 ns
  2pi    0.0000001748    266.31 ns   0.0000001748    40.60 ns 
0.0000001748    15.41 ns    0.0000001748    14.40 ns 0.0000001748    
14.40 ns    0.0000001748    14.28 ns
  -pi/4  -0.7071067691   50.00 ns    -0.7071067691   56.62 ns 
-0.7071067691   15.28 ns    -0.7071067691   15.01 ns -0.7071067691   
14.89 ns    -0.7071067691   12.30 ns
  -pi/2  -1.0000000000   273.80 ns   -1.0000000000   50.20 ns 
-1.0000000000   15.99 ns    -1.0000000000   14.40 ns -1.0000000000   
14.60 ns    -1.0000000000   12.82 ns
-3pi/4  0.7071067691    48.29 ns    0.7071067691    50.49 ns 
0.7071067691    15.82 ns    0.7071067691    14.48 ns 0.7071067691    
14.50 ns    0.7071067691    14.31 ns
  -pi    0.0000000874    274.19 ns   0.0000000874    41.50 ns 
0.0000000874    15.01 ns    0.0000000874    14.60 ns 0.0000000874    
14.38 ns    0.0000000874    12.21 ns
   -0    -0.0000000000   4.30 ns     -0.0000000000   28.10 ns 
-0.0000000000   2.81 ns     -0.0000000000   2.49 ns -0.0000000000   2.42 
ns     -0.0000000000   2.88 ns

(The fact that sin 2π ≠ 0 is due to the inability to store 2π itself 
with satisfactory precision, so it enters the function with a value that 
won't reduce to exactly zero modulo 2π)

Even against the deprecated FSIN function, my algorithm is about three 
times faster.  My algorithm calculates sin using a factorised MacLaurin 
expansion up to the 1/13! term (enough to calculate sin π for 
Single-precision) as well as performing some domain reduction so the 
argument is kept small.

Essentially, sin x = x - x^3 /3! + x^5 /5! - x^7 /7! + x^9 /9! - x^11 
/11! + x^13 /13!, which is factorised into x(1 - x^2 (1/3! - x^2 (1/5! - 
x^2 (1/7! - x^2 (1/9! - x^2 (1/11! - x^2 /13!)))))), and from here, x^2 
can be precomputed and reused, while the reciprocal factorials are just 
constants.

SSE2-optimised uses only instructions that exist on processors that 
support SS2 and nothing more (the minimum requirement for 64-bit 
Windows), but utilises some clever branchless bit manipulation on the 
domain reduction step.  All in all, other than an initial sanity check 
(exits immediately, passing the argument to the result if said argument 
is zero, negative zero or a NaN), the routines contain no branching or 
subroutine calls (they're leaf functions).

SSE4-optimised takes advantage of the ROUNDSD instruction to shave off a 
few cycles of processing during the domain reduction. AVX-optimised 
takes SSE4, translates it to AVX and also parallelises a fair bit of it, 
since the AVX instruction set is much more flexible in where results are 
written (I did try using horizontal adding at one point (SSE3 / AVX), 
but these instructions are rather slow and the end result was a 
performance loss).  The slight slowdown seems to be due to unrelated 
state switching more than anything, but is hard to benchmark.  I did try 
to introduce this parallelisation to the SSE versions, but trying to 
manipulate the registers into an ideal form added too much overhead.  
Finally, FMA finds some instances in the AVX implementation where a 
fused multiply-add can be used, reducing overall processing time by 
about 10%.

At the same time, I've also written a faster version that calculates up 
to only the 1/9! term, which introduces some imprecision, but it is 
still good for up to 5 or 6 decimal places.

Single-precision fast sin function assembly language candidates:

  Arg.       Generic function        Deprecated x87 operation         
SSE2 optimised              SSE4 optimised               AVX 
optimised               FMA optimised
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    0    0.0000000000    4.32 ns 0.0000000000    29.88 ns    
0.0000000000    12.82 ns 0.0000000000    11.69 ns    0.0000000000    
12.18 ns 0.0000000000    10.21 ns
   pi/4  0.7071067691    49.90 ns    0.7071067691    57.59 ns 
0.7071067691    12.62 ns    0.7071067691    11.60 ns 0.7071067691    
11.99 ns    0.7071067691    10.40 ns
    1    0.8414709568    49.61 ns    0.8414709568    57.89 ns 
0.8414710164    13.11 ns    0.8414710164    11.89 ns 0.8414710164    
12.50 ns    0.8414710164    11.11 ns
   pi/2  1.0000000000    266.11 ns   1.0000000000    50.00 ns 
1.0000035760    12.50 ns    1.0000035760    12.60 ns 1.0000035760    
12.18 ns    1.0000035760    10.40 ns
  3pi/4  0.7071067691    52.91 ns    0.7071067691    50.29 ns 
0.7071067691    12.82 ns    0.7071067691    12.50 ns 0.7071067691    
12.28 ns    0.7071067691    13.01 ns
    3    0.1411200017    48.10 ns    0.1411200017    43.41 ns 
0.1411200017    12.70 ns    0.1411200017    11.69 ns 0.1411200017    
12.60 ns    0.1411200017    10.60 ns
   pi    -0.0000000874   267.41 ns   -0.0000000874   42.31 ns 
-0.0000000874   12.99 ns    -0.0000000874   12.50 ns -0.0000000874   
13.01 ns    -0.0000000874   10.89 ns
  5pi/4  -0.7071067095   46.22 ns    -0.7071067095   50.59 ns 
-0.7071067095   13.21 ns    -0.7071067095   11.69 ns -0.7071067095   
12.30 ns    -0.7071067095   10.30 ns
  3pi/2  -1.0000000000   270.90 ns   -1.0000000000   49.19 ns 
-1.0000035760   12.70 ns    -1.0000035760   12.11 ns -1.0000035760   
14.09 ns    -1.0000035760   11.62 ns
  7pi/4  -0.7071068883   48.88 ns    -0.7071068883   57.20 ns 
-0.7071068883   12.50 ns    -0.7071068883   12.60 ns -0.7071068883   
12.21 ns    -0.7071068883   10.30 ns
  2pi    0.0000001748    270.51 ns   0.0000001748    42.80 ns 
0.0000001748    12.70 ns    0.0000001748    11.91 ns 0.0000001748    
14.18 ns    0.0000001748    10.30 ns
  -pi/4  -0.7071067691   50.00 ns    -0.7071067691   57.59 ns 
-0.7071067691   12.82 ns    -0.7071067691   11.89 ns -0.7071067691   
12.11 ns    -0.7071067691   10.79 ns
  -pi/2  -1.0000000000   272.19 ns   -1.0000000000   49.61 ns 
-1.0000035760   15.89 ns    -1.0000035760   11.82 ns -1.0000035760   
12.60 ns    -1.0000035760   10.40 ns
-3pi/4  0.7071067691    48.00 ns    0.7071067691    52.10 ns 
0.7071067691    12.60 ns    0.7071067691    11.89 ns 0.7071067691    
12.60 ns    0.7071067691    10.72 ns
  -pi    0.0000000874    272.49 ns   0.0000000874    42.50 ns 
0.0000000874    13.50 ns    0.0000000874    11.69 ns 0.0000000874    
12.21 ns    0.0000000874    10.50 ns
   -0    -0.0000000000   3.91 ns     -0.0000000000   29.30 ns 
-0.0000000000   13.31 ns    -0.0000000000   12.60 ns -0.0000000000   
12.18 ns    -0.0000000000   10.21 ns

The fast versions also trim off the sanity check at the start on the 
assumption that the argument has an equal chance of being anywhere 
between -π and π and not a non-sensical NaN.

If there's support, I would like to consider implementing the 1/13! 
version for a new internal "fpc_sin_single" routine in the future 
(Double types will still use 'fpc_sin_real), and I will develop a 
similar routine on AArch64 in the future.  My fast and slightly 
imprecise 1/9! version will probably be part of a third-party library.

Future work will focus on writing an equivalent implementation for cos 
and also SinCos, since the implementations that don't take advantage of 
parallelisation (currently SSE2 and SSE4) with the upper half of the XMM 
registers can easily be extended to compute cos in these lanes simply by 
adding π/2 to the initial argument.

For tan, I'm not sure yet what the best approach is, whether to just 
straight-up calculate the MacLaurin expansion (which is slightly more 
complicated) and engage in a tighter domain reduction (tan is periodic 
about -π/2 and π/2 rather than -π and π) or simply take the result of 
SinCos and divide the result of the lower lane by the result of the 
upper lane (tan x = sin x / cos x).  Metric testing will tell.  atan2 
will be another one to consider as well.

Kit


-- 
This email has been checked for viruses by Avast antivirus software.
www.avast.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20240629/068eda2d/attachment-0001.htm>


More information about the fpc-devel mailing list