[fpc-devel] Pointer cache for fast class/pointer access.

Fri Jan 20 11:18:42 CET 2012

-----Original Message----- 
From: Florian Klämpfl
Sent: Wednesday, January 04, 2012 22:20
To: FPC developers' list
Subject: Re: [fpc-devel] Pointer cache for fast class/pointer access.

Am 04.01.2012 19:24, schrieb Hans-Peter Diettrich:
> Skybuck Flying schrieb:
>>
>>
>> -----Original Message----- From: Hans-Peter Diettrich
>> Sent: Tuesday, January 03, 2012 14:56
>> To: FPC developers' list
>> Subject: Re: [fpc-devel] Pointer cache for fast class/pointer access.
>>
>> Skybuck Flying schrieb:
>>
>>> vValue := mDepth1.mDepth2.mDepth3.mValue;
>>
>> "
>> You can implement such a funny hierarchy in any language. So what?
>> "
>>
>> For the performance of the line of code above it matters if TObject or
>> Object is used.
>
> Ah, now I understand what you mean. Nonetheless I don't see sane reasons
> for creating such (deeply nested) structures in real life code - records
> would serve the same (demonstrated) purpose.

"
If one really suffers from cache misses, then using prefetch(...) might
be a good idea. And if data is heavily used, the CPU keeps it in the
caches anyways and then CPU is very good in this. I doubt any serious
perforance where data locked into L1 cache could improve program
performance. Experience showed that any static predictions like this are
much worse than that what the CPU can do at run time.
"

If the cpu is accessing highly random data, or data just once then the cache 
has no effect and would probably be used more effectively by storing data in 
it which the programmer knows for sure will be used again in the future 
execution of the program.

Some possible problems/challenges with prefetching:

1. First of all it probably introduces more execution delay because the 
prefetch instructions will need to be executed themselfes.
2. Second of all introducing prefetch instructions into the instruction 
stream needs highly accurate timing, for example if the compiler were to 
introduce these prefetch instructions then the compiler would need to be 
aware of the delay/time taken by every possible instruction and sum this all 
up to come to some kind of conclusion where to place the prefetch 
instruction, this is also depend on other literally data structures like 
L1/L2/L3 cache latency and finally main memory latency. All of this is 
processor-specific and memory-chip-specific.
At best some average/general mean could be used but would not be optimal, 
also if hardware timings change radically in the future then these prefetch 
instructions would be badly timed. One possible idea would be to introduce a 
compiler option so some kind of delay in nanoseconds (or even better 
picoseconds in case memory chip latency falls below 1 nanosecond, 
alternatively a 64 bit floating point number in seconds could also be used 
perhaps this would give even more range, but could also introduce 
imprecision/doubts/side effects(?) this could also mean the compiler needs 
floating point support, a bit high requirement it seems to maybe using a 32 
bit or 64 bit integer would be easier) could be specified or so and the 
compiler would take that into account.
Another more radical approach would be to insert the prefetch instructions 
into the computer program at runtime, but this would require self modifieing 
code and is probably way to futuristic/complex to program at this moment.

Finally if you do believe prefetching is a good alternative to a pointer 
cache or simply want to try out the effect of prefetching class pointers 
then I challenge you to implement it in a experimental branch of the free 
pascal compiler, so the prefetch idea can be tried out and tested !

One last note about the pointer cache idea, one possible way of introducing 
this idea would be to add some bit to the addressing modes/parameters of 
instructions.
For each parameter it could be specified if it's a pointer to be stored in 
the pointer cache or not. This way the extra bits to encode this information 
is limited to the instructions and doesn't require extra data bits in memory 
like the original pointer idea where the high bit of a pointer was used. 
That original idea is probably a bit too risky/shacky, but this new 
implementation idea is probably possible to some extent. However since x86 
is somewhat a fixed/set in stone instruction set I doubt if it will be 
possible, but perhaps in the future when a 128 bit instruction set is 
introduced there might be some opportunity to introduce these bits into the 
new part of the 128 bit instruction set.

Alternatively it would be somewhat nice if the processor was capable of 
detecting if an instruction parameter is a pointer yes or no, but this then 
still leaves the problem: to cache or not to cache ? This information should 
probably come from the programmer itself, hence these additional parameter 
bits.
However perhaps somebody can come up with a nice way/algorithm for a 
processor to determine if a parameter is pointing toward a class. I do fear 
that such an algorithm will probably depend on a "object address table" 
which would contain memory addresses of the object which are pointed towards 
by the classes. And then the processor would compare the instruction address 
with the object address table and when it matches retrieves the pointer from 
the data cache. But since the object address table is already stored in 
processor memory the pointer cache is double and becomes irrelevant.

Thus another more simply idea would be the following:

Instead of a pointer cache, the processor has an "object address table" 
which can be filled by the program, for example by an api or some special 
instructions.

Each time the processor executes an instruction which needs to retrieve some 
data items from memory the "object address table" is consulted. If the 
address is already in the table with the additional content of the address 
then the content of the address is fetched from the "object address table". 
Which can probably simply be renamed to "Pointer table".

This more or less already represents what the data cache is doing. Except 
there is one difference. The programmer is more in control of the pointer 
table or let's simply call it a "pointer cache" and can fill it with data 
manually. The processor itself has no control over the pointer cache. The 
processor will not trash the pointer cache... instead the processor will 
completely rely on the program/programmer to fill it with the necessary data 
for fast operation. It should also be possible to change the pointer cache 
at runtime of an application/program for adjustment.

This is a pretty cool idea and could probably be extended to more than just 
"pointers".

It could also be called a "user data cache" or "custom data cache".

So the processor's L1 cache and/or L2 cache and/or L3 cache could be split 
up into an "automatic cache" and a "manual cache".

I would also like to note that nvidia's GPUs have a similar concept called 
"shared cache" and allows the programmer control over the cache.
A feature which is actively being used by programmers with varying degrees 
of success !

Bye,
  Skybuck.