<HTML>

<div>I would like to improve more of the mathematical functions, but unless some of them are promoted to internal functions, having micro-optimisations in the code feels very bloated and will be a maintenance nightmare due to the amount of interdependency - for example, things like the floor function can be easily inlined in SSE 4.1 and AVX because it collapses into just a couple of lines of assembly, but a general-purpose Pascal version doesn't have that luxury.<br>

</div><div><br>

</div><div>Still, I would like to put a proof of concept together for the x86_64 version of floor and floor64, possibly one using a platform-specific implementation and another using a specific node optimisation.  If either one is too messy, the patch can be rejected.</div><div><br>

</div><div>One thing that I should ask though... if a unit like Math is compiled with -fAVX, then another project that uses it is built without any special floating-point types, is Math recompiled or will it use the code already built, thereby possibly putting AVX code into a non-AVX project?<br>

</div><br>

Gareth aka. Kit<br>

 <br>

<br>

<span style="font-weight: bold;">On Sun 03/02/19 16:27 , "J. Gareth Moreton" gareth@moreton-family.com sent:<br>

</span><blockquote style="BORDER-LEFT: #F5F5F5 2px solid; MARGIN-LEFT: 5px; MARGIN-RIGHT:0px; PADDING-LEFT: 5px; PADDING-RIGHT: 0px"> 

It's certainly possible, but feels a little finnicky, since floor64 is not an internal function unlike, say, the trigonometric functions.  It will also break if the original code is changed.  It feels like a kludge, especially if another programmer down the line tries to rewrite the function and is suddenly confused when the execution speed turns out slower because the node pattern is no longer identical.<br>


<br>

 
<div> The intention though was to put the improved code, with pre-processor directives to detect the FPU switches, in the platform-specific include file and wrap the original procedure in a "{$ifndef FPC_MATH_HAS_FLOOR64}", similar to how other functions in the Math unit are programmed (e.g. DivMod).<br>

 
<br>

 
To reassure, I'm aware that "float" is normally "extended" outside of x86_64, and I would keep my changes constrained to that platform.<br>

 
</div><div><br>

 
</div><div>Regarding Trunc, I'm aware that it's just "cvttsd2si %xmm0,%rax", but being assembly language, it's currently impossible to inline. Admittedly this is something I would like to develop and implement at some point, the ability to inline at least simple assembler routines where temporary registers can be replaced with virtual registers and the compiler can detect registers that map onto parameters and return values - very platform-specific though, but since "inline" is just ignored if it can't be used, it won't be an erroneous situation.<br>


<br>

 
Gareth aka. Kit</div><div><br>

 
</div><div>P.S. Documentation specifically states that the Floor function round towards negative infinity, unlike Trunc that rounds towards zero.<br>

 
</div><div><br>

 
</div><br>

 
<span style="font-weight: bold;">On Sun 03/02/19 13:11 , Florian Klämpfl florian@freepascal.org sent:<br>

 
</span><blockquote style="BORDER-LEFT: #F5F5F5 2px solid; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px; PADDING-LEFT: 5px; PADDING-RIGHT: 0px">Am 03.02.19 um 06:26 schrieb J. Gareth Moreton: 

<br>

 
<span style="color: rgb(102, 102, 102);">> Hi everyone, 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> So I'm looking to improve some of the mathematical routines.  However, 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> not all of them are internal functions and are stored in the Math 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> unit..  Some of them are written in assembly language but use the old 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> floating-point stack, or use a slow hack when there's a good alternative 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> available in SSE 4.1, for example, and I would like to see about 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> rewriting some of these functions for x86_64.  However, while I can 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> safely assume the presence of SSE2 on this architecture, what's the best 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> way to detect if "-iCOREAVX" etc are specified?  Also, if "-iCOREAVX", 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> does it automatically set "-fAVX" as well?  I rather make sure I'm not 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> making incorrect assumptions before I start writing assembly language 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> routines. 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> As an example of a function that can benefit from a speed-up under 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> x86_64... the floor() and floor64() functions: 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> function floor64(x: float): Int64; 

</span><br>

 
<span style="color: rgb(102, 102, 102);">>   begin 

</span><br>

 
<span style="color: rgb(102, 102, 102);">>     Result:=Trunc(x)-ord(Frac(x)<0); 

</span><br>

 
<span style="color: rgb(102, 102, 102);">>   end; 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> For time-critical code, this is not ideal because, besides being a 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> function itself, it calls Trunc, Frac, has a subtraction, and another 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> implicit subtraction and assignment due to the condition.  Under SSE4.1, 

</span><br>

 
<span style="color: rgb(102, 102, 102);">> this could be optimised to something like the following: 

</span><br>

 
<br>

 
Better make it inline, detect the node pattern and then generate the 

<br>

 
right instructions depending on the fpu switches. While this is still a 

<br>

 
"micro" optimization, it has its maximum benefit and does not clutter 

<br>

 
rtl units with assembler and user code using similar sequences benefit 

<br>

 
from it as well. 

<br>

 
_______________________________________________ 

<br>

 
fpc-devel maillist - <a href="mailto:fpc-devel@lists.freepascal.org">fpc-devel@lists.freepascal.org</a> 

<br>

 
<a target="_blank" href="<a href=" http:="" lists.freepascal.org="" cgi-bin="" mailman="" listinfo="" fpc-devel"="">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel</a>"><span style="color: red;">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel</span> 

<br>

 
<br>

 
<br>

 
</blockquote> 


_______________________________________________<br>


fpc-devel maillist  -  <a href="mailto:fpc-devel@lists.freepascal.org">fpc-devel@lists.freepascal.org</a><br>


<a target="_blank" href="<a href="http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel</a>"><span style="color: red;">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel</span></a><br>


<br>


</blockquote></HTML>