[fpc-devel] Modernising Pascal

Sat Feb 26 04:10:58 CET 2005

Enough guys

each camp can make distinct implementation.
Use this forum to discuss an interface.

Let the results speak for themselves.
Lets discuss test code.
Lets discuss benchmark code.

Instead of discussing bunch of what-ifs
let's see how the implementation does.

Is there some wiki[1] where we can collect mentioned ideas?

Jan

[1] (like www.twiki.org or http://c2.com/cgi/wiki)

On Feb 25, 2005, at 13:11, Sebastian Kaliszewski wrote:

> On Friday 25 February 2005 17:29, Eric Grange wrote:
>>> 1. This is not "entirely true" (the cost is at best logarithmic on 
>>> the
>>> number of objects or your allocator has terrible fragmentation)
>>
>> Not really, you can use a memory map and achieve much lower
>> fragmentation than classic memory managers
>
> 1. Memory map brings no quarantees of const time.
> 2. Same applies to GC managers. BTW GC can use the very same
> allocation/deallocation algorithm.
>
>> (cf. "FastMM" submissions
>> in the FastCode project in the b.p.d.basm borland ng).
>> In practice, you're memory usage will be a lot lower than that
>> of a GC.
>
> You're forgetting internal fragmentation. It typical allocator 
> allocating
> block with power of 2 sizes (4, 8, 16, 32, up to page size or small
> multiply of page size) assuming flat size distribution you have ~33%
> fragmentation.
>
>>
>>> 2. You have to initialise your allocated data anyway.
>>
>> That's not a practical issue because:
>> - initialization may be a task of the user's code, thus doesn't
>>    fall to the memory manager
>
> But must happen anyway.
>
>> - you can use virtual memory facilities and have the initialization
>>    to zeroes happen automagically, no need to trash the CPU cache
>>    by explicitly filling with zeroes.
>
> No way. It does not. At best it's delayed up to the point when memory 
> gets
> used actually rather than being just allocated. But then OS has to 
> clear
> the page.
>
>>
>>> 3. In real life often you have virtual memory system underneath and
>>> there is no guarantee whatosoever.
>>
>> The management costs of the VM system are typically negligible
>
> Neglibile but not const (they're mostly linear -- as memory must be 
> cleared
> before porcess might use it -- due to basic security requirements).
>
>> if you
>> don't enter the realm of disk swapping, at which point the performance
>> of GC schemes completely collapses during garbage collections
>
> Modern generational GC's are less suspectible to such problems.
>
>> and
>> compactions,
>
> Not every GC is compacting GC.
>
>> while a classic allocator can live perfectly happily with
>> part of its memory swapped out.
>>
>>> GC allocation (in case of compacting collecotr) is constant time
>>> trivially,
>>
>> Allocation is alway constant time whatever the allocator if you do it
>> right,
>
> Nope. See above & below.
> And even of you artificially restrict yourself to small span of small 
> object
> sizes yuo have to check quite a bunch of conditions, and conditional
> branches are not free. It's allwayes order or two of magnitude slower 
> than
> adding constant to a heap pointer.
>
>
>> that's not a comparison point, however, GC linear allocation isn't
>> CPU-cache friendly, meaning that if you get that memory fast, the 
>> first
>> actions you'll do on that memory will typically be an order of 
>> magnitude
>> slower than if you had been returned cached memory (particularly
>> noticeable for allocations of a few hundreds of kB of temporaries, 
>> string
>> concatenations, growing arrays, etc.).
>
> 1. ever heard of memory prefetch? You can do that explicitly, but most
> current CPU uarchitectures does that implicitly for simple access 
> pattersn
> (and linear walki is one of the simplest).
> 2. sorry but if you allocate few hundred of KB or more at once it 
> won't be
> cached anyway, regardless of the allocator.
> 3. temporaries should come from stack anyway.
>
>
>> Another drawback of linear allocation is interleaving, which often
>> reduces the cache efficiency by placing heterogeneous data together
>> (depends on the application of course, but it can hit you rather 
>> severely
>> on some composite data structures).
>
> Sorry, but this goes the opposite way.  Randomly scattered allocations 
> are
> worse than using large compact block. Those, who have to optimise their
> data for cache use know that, and allocate large memory pools and then
> arrange data the way they want. Cache colloring and stuff is easier to 
> do
> when you do not have to deal with stuff scattered everywhere.
>
> This is one of the *advantages* of compacting allocators, that they 
> alocate
> with no internal fragmentation and that after compacting frequently 
> used
> parts are close together, and with just somwehat smart design of 
> compaction
> algorithm things used together are also placed together. This is 
> certainly
> good for cache not bad (it's obvious when you look at typical CPU cache
> algorithm -- this minimises the amount of cache-line conflicts).
>
>
>>
>>> all the cost is in compacting which is linear.
>>
>> Compacting might be linear, but it's cache unfriendly, which on a 
>> modern
>> CPU means an order of magnitude slower memory operations, and that
>> happens practically on the whole allocated memory.
>
> Prefetch is your friend. Then GC might be made so that not every 
> collection
> has to hit entire memory. And collection is often incremental 
> (interleaved
> with user code execution).
>
>
>> If some of it was
>> swapped out, you can then expect hell (I've seen it happen), and as
>> compaction is a necessity of linear allocation, you can never be 100%
>> sure to avoid it with a linearly allocating GC.
>
>
> Well, I had once wait ~half a minute for one few hundred MB allocation 
> to
> finish (that was on 4BG RAM, 4 CPU Tru64 box with code compiled with 
> one of
> the best code-preformance-wise compilers around DEC CXX 6.x). This was
> explicit allocation of course (so far with allocation in constant 
> time).
> So yes, shit happens.
> The other thing is that especuially early Java implementations had 
> primitive
> & poor GC. Incremental & generational collector won't try to access 
> entire
> heap allmost at once.
>
>> Finally garbage collection isn't a linear phase, and it's also
>> intrinsically not cache friendly (lots of random accesses all over the
>> allocated memory, most of which are going to be cache misses).
>>
>> Nowadays, cache inefficiency can result in a lot of hidden perf hits,
>> with uncached accesses being 10 or more times slower (from CPU ratios,
>> RAM latencies, controler/chipset latencies, that all add to the CPU's
>> own L1/L2 latencies).
>
> Memory is 100-300 times slower latency wise than L1 cache.
>
>> 10 CPU cycles doesn't sound like much, but that's enough time for a
>> dozen+ instructions under adequate circumstances,
>
> Todays OutOfOrder CPUs can find a lot of useful work during such 
> relatively
> short (10 cycles is very little) periods of time -- btw even L2 cache 
> hits
> take longer (out of PC processors only PentiumM has 10 cycle L2 
> latency,
> others are worse.
>
>
>> have it happen too
>> often, and that's enough to make the difference between smooth and
>> sluggish.
>
>
> You didn't touch the other side. Non compacting allocation is either 
> slow or
> has significant internal fragmentation, especially heap allocating tiny
> objects is very wasteful. You'll go out of small L1 cache quick while 
> large
> portions of your cache lines will contain just unused holes.
>
> rgds
> -- 
> Sebastian Kaliszewski
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel
>