[fpc-devel] Detecting SSE and AVX compiler options

Sun Feb 3 21:26:58 CET 2019

 I would like to improve more of the mathematical functions, but unless
some of them are promoted to internal functions, having micro-optimisations
in the code feels very bloated and will be a maintenance nightmare due to
the amount of interdependency - for example, things like the floor function
can be easily inlined in SSE 4.1 and AVX because it collapses into just a
couple of lines of assembly, but a general-purpose Pascal version doesn't
have that luxury.

 Still, I would like to put a proof of concept together for the x86_64
version of floor and floor64, possibly one using a platform-specific
implementation and another using a specific node optimisation.  If either
one is too messy, the patch can be rejected.
 One thing that I should ask though... if a unit like Math is compiled with
-fAVX, then another project that uses it is built without any special
floating-point types, is Math recompiled or will it use the code already
built, thereby possibly putting AVX code into a non-AVX project?

 Gareth aka. Kit

 On Sun 03/02/19 16:27 , "J. Gareth Moreton" gareth at moreton-family.com
sent:
  It's certainly possible, but feels a little finnicky, since floor64 is
not an internal function unlike, say, the trigonometric functions.  It
will also break if the original code is changed.  It feels like a kludge,
especially if another programmer down the line tries to rewrite the
function and is suddenly confused when the execution speed turns out slower
because the node pattern is no longer identical.

  The intention though was to put the improved code, with pre-processor
directives to detect the FPU switches, in the platform-specific include
file and wrap the original procedure in a "{$ifndef FPC_MATH_HAS_FLOOR64}",
similar to how other functions in the Math unit are programmed (e.g.
DivMod).

 To reassure, I'm aware that "float" is normally "extended" outside of
x86_64, and I would keep my changes constrained to that platform.

 Regarding Trunc, I'm aware that it's just "cvttsd2si %xmm0,%rax", but
being assembly language, it's currently impossible to inline. Admittedly
this is something I would like to develop and implement at some point, the
ability to inline at least simple assembler routines where temporary
registers can be replaced with virtual registers and the compiler can
detect registers that map onto parameters and return values - very
platform-specific though, but since "inline" is just ignored if it can't be
used, it won't be an erroneous situation.

 Gareth aka. Kit
 P.S. Documentation specifically states that the Floor function round
towards negative infinity, unlike Trunc that rounds towards zero.

 On Sun 03/02/19 13:11 , Florian Klämpfl florian at freepascal.org sent:
 Am 03.02.19 um 06:26 schrieb J. Gareth Moreton: 
 > Hi everyone, 
 > 
 > So I'm looking to improve some of the mathematical routines.  However, 
 > not all of them are internal functions and are stored in the Math 
 > unit..  Some of them are written in assembly language but use the old 
 > floating-point stack, or use a slow hack when there's a good alternative

 > available in SSE 4.1, for example, and I would like to see about 
 > rewriting some of these functions for x86_64.  However, while I can 
 > safely assume the presence of SSE2 on this architecture, what's the best

 > way to detect if "-iCOREAVX" etc are specified?  Also, if "-iCOREAVX", 
 > does it automatically set "-fAVX" as well?  I rather make sure I'm not 
 > making incorrect assumptions before I start writing assembly language 
 > routines. 
 > 
 > As an example of a function that can benefit from a speed-up under 
 > x86_64... the floor() and floor64() functions: 
 > 
 > function floor64(x: float): Int64; 
 >   begin 
 >     Result:=Trunc(x)-ord(Frac(x)   end; 
 > 
 > For time-critical code, this is not ideal because, besides being a 
 > function itself, it calls Trunc, Frac, has a subtraction, and another 
 > implicit subtraction and assignment due to the condition.  Under
SSE4.1, 
 > this could be optimised to something like the following: 

 Better make it inline, detect the node pattern and then generate the 
 right instructions depending on the fpu switches. While this is still a 
 "micro" optimization, it has its maximum benefit and does not clutter 
 rtl units with assembler and user code using similar sequences benefit 
 from it as well. 
 _______________________________________________ 
 fpc-devel maillist - fpc-devel at lists.freepascal.org [1] 
 http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel 

  _______________________________________________
 fpc-devel maillist - fpc-devel at lists.freepascal.org [3]
 http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[4]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Links:
------
[1] mailto:fpc-devel at lists.freepascal.org
[2] http://secureweb.fast.net.uk/ http:=
[3] mailto:fpc-devel at lists.freepascal.org
[4] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20190203/01791d05/attachment.html>