[fpc-devel] profiling under windows

Fri Nov 20 19:21:32 CET 2009

On 20 Nov 2009, at 19:03, Sergei Gorelkin wrote:

> I did, but using Linux+valgrind rather than cygwin+gprof. IMHO valgrind (in its callgrind flavour) outputs more useful profile information.
> Some time ago I was able to optimize away about 20% of executed CPU instructions in the compiler, which however didn't decrease its execution time by any noticeable amount. So, going for another 20% will be a much more complicated task, beware.

Most of the time spent in the compiler is waiting for memory. Below are some numbers I collected with Shark (a sampling based profiler for Mac OS X) when compiling the compiler with itself and with DWARF debug info (DWARF adds a lot of individual data elements to the assembler output) on Mac OS X/i386:

Before r14137:
6.1%	ppn19sl	AGGAS_TGNUASSEMBLER_$__WRITETREE$TASMLIST
5.7%	ppn19sl	SYSTEM_SYSGETMEM_FIXED$LONGWORD$$POINTER
3.5%	libSystem.B.dylib	__bzero
2.7%	ppn19sl	CCLASSES_TFPHASHLIST_$__INTERNALFIND$LONGWORD$SHORTSTRING$LONGINT$$LONGINT
2.6%	ppn19sl	SYSTEM_TOBJECT_$__CLEANUPINSTANCE
2.2%	libSystem.B.dylib	__memcpy
2.1%	ppn19sl	SYSTEM_SYSFREEMEM_FIXED$PFREELISTS$PMEMCHUNK_FIXED$$LONGWORD
1.7%	ppn19sl	fpc_shortstr_to_shortstr
1.7%	ppn19sl	SYSTEM_SYSFREEMEM$POINTER$$LONGWORD
1.5%	ppn19sl	CCLASSES_TLINKEDLIST_$__CLEAR

After r14137:
6.4%	ppn19sl	SYSTEM_SYSGETMEM_FIXED$LONGWORD$$POINTER
4.9%	ppn19sl	AGGAS_TGNUASSEMBLER_$__WRITETREE$TASMLIST
3.3%	libSystem.B.dylib	__bzero
2.7%	ppn19sl	CCLASSES_TFPHASHLIST_$__INTERNALFIND$LONGWORD$SHORTSTRING$LONGINT$$LONGINT
2.6%	ppn19sl	SYSTEM_TOBJECT_$__CLEANUPINSTANCE
2.3%	libSystem.B.dylib	__memcpy
2.0%	ppn19sl	SYSTEM_SYSFREEMEM_FIXED$PFREELISTS$PMEMCHUNK_FIXED$$LONGWORD
1.9%	ppn19sl	fpc_shortstr_to_shortstr
1.7%	ppn19sl	CCLASSES_TLINKEDLIST_$__CLEAR
1.6%	ppn19sl	SYSTEM_SYSFREEMEM$POINTER$$LONGWORD

The only thing that changed in r14137 was adding a prefetch statement to tgnuassembler.writetree (on i386 you have to compile with -Cppentium4 or higher for the prefetch statement to do anything though). As you can see, the total number of samples in that function was reduced by 1.2% (and they no longer mostly occurred right after the instruction that loads the assembler instruction type field of the new instruction for the huge case statement).

I've tried to optimize sysgetmem_fixed also with some prefetch statements (the above already includes those, because even though I only committed them in r14197, I had them locally applied already since quite a while) but it still takes up quite a bit of time. Adding prefetches there also didn't help that much; they helped more in freemem (I believe it went from 3-4% to its current 1.6-1.7%).

Jonas