[fpc-devel] Class field reordering
fpc at erfurth.eu
Mon Jul 16 14:32:59 CEST 2012
On 16.07.12 09:22, Skybuck Flying wrote:
> I also wonder how much of an optimization it actually is ? Maybe
> 0.000001% more performance ?
Cache related optimizations are VERY hard to measure and depend on
overall context and used architecture. But as the L1-cache is one of the
most performance critical parts in these days cpus the gains of working
with cache friendly structures should not be underestimated.
There a a couple of things that need to be taken into account.
1.) Cacheline Utilization: Packing together multiple smaller items into
single (machine-)words allows for better utilization of precious cache
space. As L1-DCache is usually only 16-32kbytes these days, every byte
counts, because saving a single byte can make a difference between using
one or two cache lines, which in return will save memory bandwidth and
save you from a cache-miss related stall down the line.
2.) Cacheline Streaming: unless your memory bus is a wide as your
cacheline it takes multiple cycles to fetch a whole cacheline. The
pipeline has to stall till the data in question arrives. If you have
relevant data in the end of a cacheline you will have to wait for the
whole transaction to complete. So it makes sense to order fields by
occurence of access so your first miss will hopefully only lead to the
minimal stall time.
While modern CPUs can circumvent/hide some of the cache miss latency
with the help of prefetching, out-of-order execution and Hyperthreading
it will still lead to a performance penalty. For CPUs without these
features (like most ARM cores) this penalty can become substantial,
leading to 100 and more stall cycles for a cache-miss.
More information about the fpc-devel