<!DOCTYPE html>

<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hi everyone,</p>

    <p>So for a few weeks now, I've been playing with the idea of

      speeding up the trig functions in Free Pascal, or at least the

      trig functions for Single-precision which find uses in graphical

      applications where deep precision is not absolutely necessary, but

      speed is paramount.  Nevertheless, I sought to preserve the

      precision when compared with the current generic algorithm.</p>

    <p>Note that this only works for Single-precision floating-point, as

      my assembly routines make use of Double-precision for intermediate

      calculations but cannot eliminate rounding errors due to the

      operations utilised.</p>

    <p>These are my current metrics for the sin function:</p>

    <p>Single-precision sin function assembly language candidates:<br>

      <br>

      <font face="monospace">Single-precision sin function assembly

        language candidates:<br>

        <br>

         Arg.       Generic function        Deprecated x87

        operation         SSE2 optimised              SSE4

        optimised               AVX optimised               FMA

        optimised<br>

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br>

           0    0.0000000000    3.91 ns     0.0000000000    27.20 ns   

        0.0000000000    2.91 ns     0.0000000000    2.29 ns    

        0.0000000000    2.49 ns     0.0000000000    2.71 ns<br>

          pi/4  0.7071067691    49.19 ns    0.7071067691    55.40 ns   

        0.7071067691    15.31 ns    0.7071067691    15.11 ns   

        0.7071067691    14.60 ns    0.7071067691    12.60 ns<br>

           1    0.8414709568    49.29 ns    0.8414709568    55.81 ns   

        0.8414709568    15.09 ns    0.8414709568    15.41 ns   

        0.8414709568    14.40 ns    0.8414709568    13.09 ns<br>

          pi/2  1.0000000000    266.82 ns   1.0000000000    49.00 ns   

        1.0000000000    15.28 ns    1.0000000000    14.92 ns   

        1.0000000000    14.70 ns    1.0000000000    12.70 ns<br>

         3pi/4  0.7071067691    48.19 ns    0.7071067691    49.00 ns   

        0.7071067691    15.21 ns    0.7071067691    14.99 ns   

        0.7071067691    14.40 ns    0.7071067691    12.11 ns<br>

           3    0.1411200017    48.80 ns    0.1411200017    41.38 ns   

        0.1411200017    15.41 ns    0.1411200017    14.31 ns   

        0.1411200017    15.41 ns    0.1411200017    12.60 ns<br>

          pi    -0.0000000874   266.50 ns   -0.0000000874   40.99 ns   

        -0.0000000874   15.19 ns    -0.0000000874   14.31 ns   

        -0.0000000874   14.31 ns    -0.0000000874   12.60 ns<br>

         5pi/4  -0.7071067095   46.70 ns    -0.7071067095   48.80 ns   

        -0.7071067095   15.80 ns    -0.7071067095   15.70 ns   

        -0.7071067095   14.89 ns    -0.7071067095   12.11 ns<br>

         3pi/2  -1.0000000000   268.90 ns   -1.0000000000   49.00 ns   

        -1.0000000000   15.31 ns    -1.0000000000   14.99 ns   

        -1.0000000000   15.01 ns    -1.0000000000   12.08 ns<br>

         7pi/4  -0.7071068883   49.19 ns    -0.7071068883   54.81 ns   

        -0.7071068883   15.89 ns    -0.7071068883   14.31 ns   

        -0.7071068883   14.50 ns    -0.7071068883   12.99 ns<br>

         2pi    0.0000001748    266.31 ns   0.0000001748    40.60 ns   

        0.0000001748    15.41 ns    0.0000001748    14.40 ns   

        0.0000001748    14.40 ns    0.0000001748    14.28 ns<br>

         -pi/4  -0.7071067691   50.00 ns    -0.7071067691   56.62 ns   

        -0.7071067691   15.28 ns    -0.7071067691   15.01 ns   

        -0.7071067691   14.89 ns    -0.7071067691   12.30 ns<br>

         -pi/2  -1.0000000000   273.80 ns   -1.0000000000   50.20 ns   

        -1.0000000000   15.99 ns    -1.0000000000   14.40 ns   

        -1.0000000000   14.60 ns    -1.0000000000   12.82 ns<br>

        -3pi/4  0.7071067691    48.29 ns    0.7071067691    50.49 ns   

        0.7071067691    15.82 ns    0.7071067691    14.48 ns   

        0.7071067691    14.50 ns    0.7071067691    14.31 ns<br>

         -pi    0.0000000874    274.19 ns   0.0000000874    41.50 ns   

        0.0000000874    15.01 ns    0.0000000874    14.60 ns   

        0.0000000874    14.38 ns    0.0000000874    12.21 ns<br>

          -0    -0.0000000000   4.30 ns     -0.0000000000   28.10 ns   

        -0.0000000000   2.81 ns     -0.0000000000   2.49 ns    

        -0.0000000000   2.42 ns     -0.0000000000   2.88 ns<br>

      </font></p>

    <p>(The fact that sin 2π ≠ 0 is due to the inability to store 2π

      itself with satisfactory precision, so it enters the function with

      a value that won't reduce to exactly zero modulo 2π)<br>

    </p>

    <p>Even against the deprecated FSIN function, my algorithm is about

      three times faster.  My algorithm calculates sin using a

      factorised MacLaurin expansion up to the 1/13! term (enough to

      calculate sin π for Single-precision) as well as performing some

      domain reduction so the argument is kept small.</p>

    <p>Essentially, sin x = x - x<sup>3</sup>/3! + x<sup>5</sup>/5! - x<sup>7</sup>/7!

      + x<sup>9</sup>/9! - x<sup>11</sup>/11! + x<sup>13</sup>/13!,

      which is factorised into x(1 - x<sup>2</sup>(1/3! - x<sup>2</sup>(1/5!

      - x<sup>2</sup>(1/7! - x<sup>2</sup>(1/9! - x<sup>2</sup>(1/11! -

      x<sup>2</sup>/13!)))))), and from here, x<sup>2</sup> can be

      precomputed and reused, while the reciprocal factorials are just

      constants.<br>

    </p>

    <p>SSE2-optimised uses only instructions that exist on processors

      that support SS2 and nothing more (the minimum requirement for

      64-bit Windows), but utilises some clever branchless bit

      manipulation on the domain reduction step.  All in all, other than

      an initial sanity check (exits immediately, passing the argument

      to the result if said argument is zero, negative zero or a NaN),

      the routines contain no branching or subroutine calls (they're

      leaf functions).<br>

    </p>

    <p>SSE4-optimised takes advantage of the ROUNDSD instruction to

      shave off a few cycles of processing during the domain reduction. 

      AVX-optimised takes SSE4, translates it to AVX and also

      parallelises a fair bit of it, since the AVX instruction set is

      much more flexible in where results are written (I did try using

      horizontal adding at one point (SSE3 / AVX), but these

      instructions are rather slow and the end result was a performance

      loss).  The slight slowdown seems to be due to unrelated state

      switching more than anything, but is hard to benchmark.  I did try

      to introduce this parallelisation to the SSE versions, but trying

      to manipulate the registers into an ideal form added too much

      overhead.  Finally, FMA finds some instances in the AVX

      implementation where a fused multiply-add can be used, reducing

      overall processing time by about 10%.<br>

    </p>

    <p>At the same time, I've also written a faster version that

      calculates up to only the 1/9! term, which introduces some

      imprecision, but it is still good for up to 5 or 6 decimal places.<br>

    </p>

    <p><font face="monospace">Single-precision fast sin function

        assembly language candidates:<br>

        <br>

         Arg.       Generic function        Deprecated x87

        operation         SSE2 optimised              SSE4

        optimised               AVX optimised               FMA

        optimised<br>

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br>

      </font><font face="monospace">   0    0.0000000000    4.32 ns    

        0.0000000000    29.88 ns    0.0000000000    12.82 ns   

        0.0000000000    11.69 ns    0.0000000000    12.18 ns   

        0.0000000000    10.21 ns<br>

          pi/4  0.7071067691    49.90 ns    0.7071067691    57.59 ns   

        0.7071067691    12.62 ns    0.7071067691    11.60 ns   

        0.7071067691    11.99 ns    0.7071067691    10.40 ns<br>

           1    0.8414709568    49.61 ns    0.8414709568    57.89 ns   

        0.8414710164    13.11 ns    0.8414710164    11.89 ns   

        0.8414710164    12.50 ns    0.8414710164    11.11 ns<br>

          pi/2  1.0000000000    266.11 ns   1.0000000000    50.00 ns   

        1.0000035760    12.50 ns    1.0000035760    12.60 ns   

        1.0000035760    12.18 ns    1.0000035760    10.40 ns<br>

         3pi/4  0.7071067691    52.91 ns    0.7071067691    50.29 ns   

        0.7071067691    12.82 ns    0.7071067691    12.50 ns   

        0.7071067691    12.28 ns    0.7071067691    13.01 ns<br>

           3    0.1411200017    48.10 ns    0.1411200017    43.41 ns   

        0.1411200017    12.70 ns    0.1411200017    11.69 ns   

        0.1411200017    12.60 ns    0.1411200017    10.60 ns<br>

          pi    -0.0000000874   267.41 ns   -0.0000000874   42.31 ns   

        -0.0000000874   12.99 ns    -0.0000000874   12.50 ns   

        -0.0000000874   13.01 ns    -0.0000000874   10.89 ns<br>

         5pi/4  -0.7071067095   46.22 ns    -0.7071067095   50.59 ns   

        -0.7071067095   13.21 ns    -0.7071067095   11.69 ns   

        -0.7071067095   12.30 ns    -0.7071067095   10.30 ns<br>

         3pi/2  -1.0000000000   270.90 ns   -1.0000000000   49.19 ns   

        -1.0000035760   12.70 ns    -1.0000035760   12.11 ns   

        -1.0000035760   14.09 ns    -1.0000035760   11.62 ns<br>

         7pi/4  -0.7071068883   48.88 ns    -0.7071068883   57.20 ns   

        -0.7071068883   12.50 ns    -0.7071068883   12.60 ns   

        -0.7071068883   12.21 ns    -0.7071068883   10.30 ns<br>

         2pi    0.0000001748    270.51 ns   0.0000001748    42.80 ns   

        0.0000001748    12.70 ns    0.0000001748    11.91 ns   

        0.0000001748    14.18 ns    0.0000001748    10.30 ns<br>

         -pi/4  -0.7071067691   50.00 ns    -0.7071067691   57.59 ns   

        -0.7071067691   12.82 ns    -0.7071067691   11.89 ns   

        -0.7071067691   12.11 ns    -0.7071067691   10.79 ns<br>

         -pi/2  -1.0000000000   272.19 ns   -1.0000000000   49.61 ns   

        -1.0000035760   15.89 ns    -1.0000035760   11.82 ns   

        -1.0000035760   12.60 ns    -1.0000035760   10.40 ns<br>

        -3pi/4  0.7071067691    48.00 ns    0.7071067691    52.10 ns   

        0.7071067691    12.60 ns    0.7071067691    11.89 ns   

        0.7071067691    12.60 ns    0.7071067691    10.72 ns<br>

         -pi    0.0000000874    272.49 ns   0.0000000874    42.50 ns   

        0.0000000874    13.50 ns    0.0000000874    11.69 ns   

        0.0000000874    12.21 ns    0.0000000874    10.50 ns<br>

          -0    -0.0000000000   3.91 ns     -0.0000000000   29.30 ns   

        -0.0000000000   13.31 ns    -0.0000000000   12.60 ns   

        -0.0000000000   12.18 ns    -0.0000000000   10.21 ns<br>

      </font></p>

    <p>The fast versions also trim off the sanity check at the start on

      the assumption that the argument has an equal chance of being

      anywhere between -π and π and not a non-sensical NaN.</p>

    <p>If there's support, I would like to consider implementing the

      1/13! version for a new internal "fpc_sin_single" routine in the

      future (Double types will still use 'fpc_sin_real), and I will

      develop a similar routine on AArch64 in the future.  My fast and

      slightly imprecise 1/9! version will probably be part of a

      third-party library.</p>

    <p>Future work will focus on writing an equivalent implementation

      for cos and also SinCos, since the implementations that don't take

      advantage of parallelisation (currently SSE2 and SSE4) with the

      upper half of the XMM registers can easily be extended to compute

      cos in these lanes simply by adding π/2 to the initial argument.</p>

    <p>For tan, I'm not sure yet what the best approach is, whether to

      just straight-up calculate the MacLaurin expansion (which is

      slightly more complicated) and engage in a tighter domain

      reduction (tan is periodic about -π/2 and π/2 rather than -π and

      π) or simply take the result of SinCos and divide the result of

      the lower lane by the result of the upper lane (tan x = sin x /

      cos x).  Metric testing will tell.  atan2 will be another one to

      consider as well.<br>

    </p>

    <p>Kit<br>

    </p>

  <div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br /><table style="border-top: 1px solid #D3D4DE;"><tr><td style="width: 55px; padding-top: 13px;"><a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" target="_blank"><img src="https://s-install.avcdn.net/ipm/preview/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif" alt="" width="46" height="29" style="width: 46px; height: 29px;"/></a></td><td style="width: 470px; padding-top: 12px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">Virus-free.<a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" target="_blank" style="color: #4453ea;">www.avast.com</a></td></tr></table><a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"> </a></div></body>

</html>