[fpc-devel] @Gareth - Feedback on optimizations

Sun Jan 9 20:25:08 CET 2022

On 09/01/2022 19:04, Florian Klämpfl via fpc-devel wrote:
>> Am 09.01.2022 um 18:59 schrieb Martin Frb via fpc-devel <fpc-devel at lists.freepascal.org>:
>>
>> Just thought this may be interesting.
>> Though the results are for "eyeballing" at best -- see the video linked in Jonas' recent mail:
>> https://lists.freepascal.org/pipermail/fpc-devel/2022-January/044336.html
>>
>> I run my FpDebug test case with 3.2.3 and 3.3.1 (both early Dec)
>>
>> The test runs about 15 secs (evaluating approx 70k watches).
>> The test was done with a modified FpDebug for single threaded execution (threaded results are similar, but have more variance)
>>
>> Tests done with
>>
>> FPC 3.2.3 -O2  :  14.1 - 14.3 sec
>> FPC 3.3.1 -O2  :  13.3 - 13.5 sec
>>
>> FPC 3.2.3 -O-1 -gw3  :  15.0 - 15.2 sec
>> FPC 3.3.1 -O-1 -gw3   :  14.3 - 14.5 sec
>>
>> So on my PC, with this particular compilation setting the speedup (intentional plus side-effects) is 4 to 5 percent.
> What about -O3?
>
Ups, the above was misleading.

The O1/O2 was only compiler and rtl.

FpDebug, the testcase, and LazUtils where build in all cases with -O4.

Running with FPC/rtl on  -O4

FPC 3.2.3 -O4 : 13.8 - 14.0 sec
FPC 3.3.1 -O4 : 13.0 - 13.2 sec

Compiling the TEST (and fpdebug/lazutils) with -O3 instead.

TEST -O3 / FPC 3.2.3 -O2 : 13.9 - 14.0 sec
TEST -O3 / FPC 3.3.1 -O2 : 13.1 - 13.2 sec

TEST -O3 / FPC 3.2.3 -O-1 -gw3 : 14.9 - 15.2 sec
TEST -O3 / FPC 3.3.1 -O-1 -gw3 : 14.1 - 14.3 sec

Interesting, the last one "TEST -O3 FPC -O2" is a bit faster, than the 
"test -O4 / FPC -O2" with the same fpc in the original post (very first 
test in the orig mail).
So either there are side-effects, or something is overdone....

---------------------------
I'll see to it, that I commit the testcase with the option to use "none 
threading".
So it can be run by others too.
Maybe (if interest exists, and I find a few minutes) add a benchmark 
wrapper.

I have not (yet) profiled the testcase.
The time would go into
- loading the exe (it is already pre-compiled), and start the exe in the 
debugger, run to breakpoint
   (the testcase, contains an fpdebug debugger, that loads and debugs a 
real exe)
- pre-parsing the debug info
- looking up info for each watch
- formatting the watch as string
- in the test, compare or regex-match the string

The first (actually first 2 steps) are not dependent on the number of 
watches. The majority of the time goes (should go) into the last 3 steps.
I could imagine, but as I said not yet profiled, that string processing 
takes it noticeable share.
If, so that might be part of why the rtl optimization level has such