<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hi everyone,</p>
<p>So for a few weeks now, I've been playing with the idea of
speeding up the trig functions in Free Pascal, or at least the
trig functions for Single-precision which find uses in graphical
applications where deep precision is not absolutely necessary, but
speed is paramount. Nevertheless, I sought to preserve the
precision when compared with the current generic algorithm.</p>
<p>Note that this only works for Single-precision floating-point, as
my assembly routines make use of Double-precision for intermediate
calculations but cannot eliminate rounding errors due to the
operations utilised.</p>
<p>These are my current metrics for the sin function:</p>
<p>Single-precision sin function assembly language candidates:<br>
<br>
<font face="monospace">Single-precision sin function assembly
language candidates:<br>
<br>
Arg. Generic function Deprecated x87
operation SSE2 optimised SSE4
optimised AVX optimised FMA
optimised<br>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br>
0 0.0000000000 3.91 ns 0.0000000000 27.20 ns
0.0000000000 2.91 ns 0.0000000000 2.29 ns
0.0000000000 2.49 ns 0.0000000000 2.71 ns<br>
pi/4 0.7071067691 49.19 ns 0.7071067691 55.40 ns
0.7071067691 15.31 ns 0.7071067691 15.11 ns
0.7071067691 14.60 ns 0.7071067691 12.60 ns<br>
1 0.8414709568 49.29 ns 0.8414709568 55.81 ns
0.8414709568 15.09 ns 0.8414709568 15.41 ns
0.8414709568 14.40 ns 0.8414709568 13.09 ns<br>
pi/2 1.0000000000 266.82 ns 1.0000000000 49.00 ns
1.0000000000 15.28 ns 1.0000000000 14.92 ns
1.0000000000 14.70 ns 1.0000000000 12.70 ns<br>
3pi/4 0.7071067691 48.19 ns 0.7071067691 49.00 ns
0.7071067691 15.21 ns 0.7071067691 14.99 ns
0.7071067691 14.40 ns 0.7071067691 12.11 ns<br>
3 0.1411200017 48.80 ns 0.1411200017 41.38 ns
0.1411200017 15.41 ns 0.1411200017 14.31 ns
0.1411200017 15.41 ns 0.1411200017 12.60 ns<br>
pi -0.0000000874 266.50 ns -0.0000000874 40.99 ns
-0.0000000874 15.19 ns -0.0000000874 14.31 ns
-0.0000000874 14.31 ns -0.0000000874 12.60 ns<br>
5pi/4 -0.7071067095 46.70 ns -0.7071067095 48.80 ns
-0.7071067095 15.80 ns -0.7071067095 15.70 ns
-0.7071067095 14.89 ns -0.7071067095 12.11 ns<br>
3pi/2 -1.0000000000 268.90 ns -1.0000000000 49.00 ns
-1.0000000000 15.31 ns -1.0000000000 14.99 ns
-1.0000000000 15.01 ns -1.0000000000 12.08 ns<br>
7pi/4 -0.7071068883 49.19 ns -0.7071068883 54.81 ns
-0.7071068883 15.89 ns -0.7071068883 14.31 ns
-0.7071068883 14.50 ns -0.7071068883 12.99 ns<br>
2pi 0.0000001748 266.31 ns 0.0000001748 40.60 ns
0.0000001748 15.41 ns 0.0000001748 14.40 ns
0.0000001748 14.40 ns 0.0000001748 14.28 ns<br>
-pi/4 -0.7071067691 50.00 ns -0.7071067691 56.62 ns
-0.7071067691 15.28 ns -0.7071067691 15.01 ns
-0.7071067691 14.89 ns -0.7071067691 12.30 ns<br>
-pi/2 -1.0000000000 273.80 ns -1.0000000000 50.20 ns
-1.0000000000 15.99 ns -1.0000000000 14.40 ns
-1.0000000000 14.60 ns -1.0000000000 12.82 ns<br>
-3pi/4 0.7071067691 48.29 ns 0.7071067691 50.49 ns
0.7071067691 15.82 ns 0.7071067691 14.48 ns
0.7071067691 14.50 ns 0.7071067691 14.31 ns<br>
-pi 0.0000000874 274.19 ns 0.0000000874 41.50 ns
0.0000000874 15.01 ns 0.0000000874 14.60 ns
0.0000000874 14.38 ns 0.0000000874 12.21 ns<br>
-0 -0.0000000000 4.30 ns -0.0000000000 28.10 ns
-0.0000000000 2.81 ns -0.0000000000 2.49 ns
-0.0000000000 2.42 ns -0.0000000000 2.88 ns<br>
</font></p>
<p>(The fact that sin 2π ≠ 0 is due to the inability to store 2π
itself with satisfactory precision, so it enters the function with
a value that won't reduce to exactly zero modulo 2π)<br>
</p>
<p>Even against the deprecated FSIN function, my algorithm is about
three times faster. My algorithm calculates sin using a
factorised MacLaurin expansion up to the 1/13! term (enough to
calculate sin π for Single-precision) as well as performing some
domain reduction so the argument is kept small.</p>
<p>Essentially, sin x = x - x<sup>3</sup>/3! + x<sup>5</sup>/5! - x<sup>7</sup>/7!
+ x<sup>9</sup>/9! - x<sup>11</sup>/11! + x<sup>13</sup>/13!,
which is factorised into x(1 - x<sup>2</sup>(1/3! - x<sup>2</sup>(1/5!
- x<sup>2</sup>(1/7! - x<sup>2</sup>(1/9! - x<sup>2</sup>(1/11! -
x<sup>2</sup>/13!)))))), and from here, x<sup>2</sup> can be
precomputed and reused, while the reciprocal factorials are just
constants.<br>
</p>
<p>SSE2-optimised uses only instructions that exist on processors
that support SS2 and nothing more (the minimum requirement for
64-bit Windows), but utilises some clever branchless bit
manipulation on the domain reduction step. All in all, other than
an initial sanity check (exits immediately, passing the argument
to the result if said argument is zero, negative zero or a NaN),
the routines contain no branching or subroutine calls (they're
leaf functions).<br>
</p>
<p>SSE4-optimised takes advantage of the ROUNDSD instruction to
shave off a few cycles of processing during the domain reduction.
AVX-optimised takes SSE4, translates it to AVX and also
parallelises a fair bit of it, since the AVX instruction set is
much more flexible in where results are written (I did try using
horizontal adding at one point (SSE3 / AVX), but these
instructions are rather slow and the end result was a performance
loss). The slight slowdown seems to be due to unrelated state
switching more than anything, but is hard to benchmark. I did try
to introduce this parallelisation to the SSE versions, but trying
to manipulate the registers into an ideal form added too much
overhead. Finally, FMA finds some instances in the AVX
implementation where a fused multiply-add can be used, reducing
overall processing time by about 10%.<br>
</p>
<p>At the same time, I've also written a faster version that
calculates up to only the 1/9! term, which introduces some
imprecision, but it is still good for up to 5 or 6 decimal places.<br>
</p>
<p><font face="monospace">Single-precision fast sin function
assembly language candidates:<br>
<br>
Arg. Generic function Deprecated x87
operation SSE2 optimised SSE4
optimised AVX optimised FMA
optimised<br>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br>
</font><font face="monospace"> 0 0.0000000000 4.32 ns
0.0000000000 29.88 ns 0.0000000000 12.82 ns
0.0000000000 11.69 ns 0.0000000000 12.18 ns
0.0000000000 10.21 ns<br>
pi/4 0.7071067691 49.90 ns 0.7071067691 57.59 ns
0.7071067691 12.62 ns 0.7071067691 11.60 ns
0.7071067691 11.99 ns 0.7071067691 10.40 ns<br>
1 0.8414709568 49.61 ns 0.8414709568 57.89 ns
0.8414710164 13.11 ns 0.8414710164 11.89 ns
0.8414710164 12.50 ns 0.8414710164 11.11 ns<br>
pi/2 1.0000000000 266.11 ns 1.0000000000 50.00 ns
1.0000035760 12.50 ns 1.0000035760 12.60 ns
1.0000035760 12.18 ns 1.0000035760 10.40 ns<br>
3pi/4 0.7071067691 52.91 ns 0.7071067691 50.29 ns
0.7071067691 12.82 ns 0.7071067691 12.50 ns
0.7071067691 12.28 ns 0.7071067691 13.01 ns<br>
3 0.1411200017 48.10 ns 0.1411200017 43.41 ns
0.1411200017 12.70 ns 0.1411200017 11.69 ns
0.1411200017 12.60 ns 0.1411200017 10.60 ns<br>
pi -0.0000000874 267.41 ns -0.0000000874 42.31 ns
-0.0000000874 12.99 ns -0.0000000874 12.50 ns
-0.0000000874 13.01 ns -0.0000000874 10.89 ns<br>
5pi/4 -0.7071067095 46.22 ns -0.7071067095 50.59 ns
-0.7071067095 13.21 ns -0.7071067095 11.69 ns
-0.7071067095 12.30 ns -0.7071067095 10.30 ns<br>
3pi/2 -1.0000000000 270.90 ns -1.0000000000 49.19 ns
-1.0000035760 12.70 ns -1.0000035760 12.11 ns
-1.0000035760 14.09 ns -1.0000035760 11.62 ns<br>
7pi/4 -0.7071068883 48.88 ns -0.7071068883 57.20 ns
-0.7071068883 12.50 ns -0.7071068883 12.60 ns
-0.7071068883 12.21 ns -0.7071068883 10.30 ns<br>
2pi 0.0000001748 270.51 ns 0.0000001748 42.80 ns
0.0000001748 12.70 ns 0.0000001748 11.91 ns
0.0000001748 14.18 ns 0.0000001748 10.30 ns<br>
-pi/4 -0.7071067691 50.00 ns -0.7071067691 57.59 ns
-0.7071067691 12.82 ns -0.7071067691 11.89 ns
-0.7071067691 12.11 ns -0.7071067691 10.79 ns<br>
-pi/2 -1.0000000000 272.19 ns -1.0000000000 49.61 ns
-1.0000035760 15.89 ns -1.0000035760 11.82 ns
-1.0000035760 12.60 ns -1.0000035760 10.40 ns<br>
-3pi/4 0.7071067691 48.00 ns 0.7071067691 52.10 ns
0.7071067691 12.60 ns 0.7071067691 11.89 ns
0.7071067691 12.60 ns 0.7071067691 10.72 ns<br>
-pi 0.0000000874 272.49 ns 0.0000000874 42.50 ns
0.0000000874 13.50 ns 0.0000000874 11.69 ns
0.0000000874 12.21 ns 0.0000000874 10.50 ns<br>
-0 -0.0000000000 3.91 ns -0.0000000000 29.30 ns
-0.0000000000 13.31 ns -0.0000000000 12.60 ns
-0.0000000000 12.18 ns -0.0000000000 10.21 ns<br>
</font></p>
<p>The fast versions also trim off the sanity check at the start on
the assumption that the argument has an equal chance of being
anywhere between -π and π and not a non-sensical NaN.</p>
<p>If there's support, I would like to consider implementing the
1/13! version for a new internal "fpc_sin_single" routine in the
future (Double types will still use 'fpc_sin_real), and I will
develop a similar routine on AArch64 in the future. My fast and
slightly imprecise 1/9! version will probably be part of a
third-party library.</p>
<p>Future work will focus on writing an equivalent implementation
for cos and also SinCos, since the implementations that don't take
advantage of parallelisation (currently SSE2 and SSE4) with the
upper half of the XMM registers can easily be extended to compute
cos in these lanes simply by adding π/2 to the initial argument.</p>
<p>For tan, I'm not sure yet what the best approach is, whether to
just straight-up calculate the MacLaurin expansion (which is
slightly more complicated) and engage in a tighter domain
reduction (tan is periodic about -π/2 and π/2 rather than -π and
π) or simply take the result of SinCos and divide the result of
the lower lane by the result of the upper lane (tan x = sin x /
cos x). Metric testing will tell. atan2 will be another one to
consider as well.<br>
</p>
<p>Kit<br>
</p>
<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br /><table style="border-top: 1px solid #D3D4DE;"><tr><td style="width: 55px; padding-top: 13px;"><a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" target="_blank"><img src="https://s-install.avcdn.net/ipm/preview/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif" alt="" width="46" height="29" style="width: 46px; height: 29px;"/></a></td><td style="width: 470px; padding-top: 12px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">Virus-free.<a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" target="_blank" style="color: #4453ea;">www.avast.com</a></td></tr></table><a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"> </a></div></body>
</html>