[fpc-devel] profiling under windows
Jonas Maebe
jonas.maebe at elis.ugent.be
Fri Nov 20 19:21:32 CET 2009
On 20 Nov 2009, at 19:03, Sergei Gorelkin wrote:
> I did, but using Linux+valgrind rather than cygwin+gprof. IMHO valgrind (in its callgrind flavour) outputs more useful profile information.
> Some time ago I was able to optimize away about 20% of executed CPU instructions in the compiler, which however didn't decrease its execution time by any noticeable amount. So, going for another 20% will be a much more complicated task, beware.
Most of the time spent in the compiler is waiting for memory. Below are some numbers I collected with Shark (a sampling based profiler for Mac OS X) when compiling the compiler with itself and with DWARF debug info (DWARF adds a lot of individual data elements to the assembler output) on Mac OS X/i386:
Before r14137:
6.1% ppn19sl AGGAS_TGNUASSEMBLER_$__WRITETREE$TASMLIST
5.7% ppn19sl SYSTEM_SYSGETMEM_FIXED$LONGWORD$$POINTER
3.5% libSystem.B.dylib __bzero
2.7% ppn19sl CCLASSES_TFPHASHLIST_$__INTERNALFIND$LONGWORD$SHORTSTRING$LONGINT$$LONGINT
2.6% ppn19sl SYSTEM_TOBJECT_$__CLEANUPINSTANCE
2.2% libSystem.B.dylib __memcpy
2.1% ppn19sl SYSTEM_SYSFREEMEM_FIXED$PFREELISTS$PMEMCHUNK_FIXED$$LONGWORD
1.7% ppn19sl fpc_shortstr_to_shortstr
1.7% ppn19sl SYSTEM_SYSFREEMEM$POINTER$$LONGWORD
1.5% ppn19sl CCLASSES_TLINKEDLIST_$__CLEAR
After r14137:
6.4% ppn19sl SYSTEM_SYSGETMEM_FIXED$LONGWORD$$POINTER
4.9% ppn19sl AGGAS_TGNUASSEMBLER_$__WRITETREE$TASMLIST
3.3% libSystem.B.dylib __bzero
2.7% ppn19sl CCLASSES_TFPHASHLIST_$__INTERNALFIND$LONGWORD$SHORTSTRING$LONGINT$$LONGINT
2.6% ppn19sl SYSTEM_TOBJECT_$__CLEANUPINSTANCE
2.3% libSystem.B.dylib __memcpy
2.0% ppn19sl SYSTEM_SYSFREEMEM_FIXED$PFREELISTS$PMEMCHUNK_FIXED$$LONGWORD
1.9% ppn19sl fpc_shortstr_to_shortstr
1.7% ppn19sl CCLASSES_TLINKEDLIST_$__CLEAR
1.6% ppn19sl SYSTEM_SYSFREEMEM$POINTER$$LONGWORD
The only thing that changed in r14137 was adding a prefetch statement to tgnuassembler.writetree (on i386 you have to compile with -Cppentium4 or higher for the prefetch statement to do anything though). As you can see, the total number of samples in that function was reduced by 1.2% (and they no longer mostly occurred right after the instruction that loads the assembler instruction type field of the new instruction for the huge case statement).
I've tried to optimize sysgetmem_fixed also with some prefetch statements (the above already includes those, because even though I only committed them in r14197, I had them locally applied already since quite a while) but it still takes up quite a bit of time. Adding prefetches there also didn't help that much; they helped more in freemem (I believe it went from 3-4% to its current 1.6-1.7%).
Jonas
More information about the fpc-devel
mailing list