[fpc-devel] Pointer cache for fast class/pointer access.
Skybuck Flying
skybuck2000 at hotmail.com
Fri Jan 20 11:18:42 CET 2012
-----Original Message-----
From: Florian Klämpfl
Sent: Wednesday, January 04, 2012 22:20
To: FPC developers' list
Subject: Re: [fpc-devel] Pointer cache for fast class/pointer access.
Am 04.01.2012 19:24, schrieb Hans-Peter Diettrich:
> Skybuck Flying schrieb:
>>
>>
>> -----Original Message----- From: Hans-Peter Diettrich
>> Sent: Tuesday, January 03, 2012 14:56
>> To: FPC developers' list
>> Subject: Re: [fpc-devel] Pointer cache for fast class/pointer access.
>>
>> Skybuck Flying schrieb:
>>
>>> vValue := mDepth1.mDepth2.mDepth3.mValue;
>>
>> "
>> You can implement such a funny hierarchy in any language. So what?
>> "
>>
>> For the performance of the line of code above it matters if TObject or
>> Object is used.
>
> Ah, now I understand what you mean. Nonetheless I don't see sane reasons
> for creating such (deeply nested) structures in real life code - records
> would serve the same (demonstrated) purpose.
"
If one really suffers from cache misses, then using prefetch(...) might
be a good idea. And if data is heavily used, the CPU keeps it in the
caches anyways and then CPU is very good in this. I doubt any serious
perforance where data locked into L1 cache could improve program
performance. Experience showed that any static predictions like this are
much worse than that what the CPU can do at run time.
"
If the cpu is accessing highly random data, or data just once then the cache
has no effect and would probably be used more effectively by storing data in
it which the programmer knows for sure will be used again in the future
execution of the program.
Some possible problems/challenges with prefetching:
1. First of all it probably introduces more execution delay because the
prefetch instructions will need to be executed themselfes.
2. Second of all introducing prefetch instructions into the instruction
stream needs highly accurate timing, for example if the compiler were to
introduce these prefetch instructions then the compiler would need to be
aware of the delay/time taken by every possible instruction and sum this all
up to come to some kind of conclusion where to place the prefetch
instruction, this is also depend on other literally data structures like
L1/L2/L3 cache latency and finally main memory latency. All of this is
processor-specific and memory-chip-specific.
At best some average/general mean could be used but would not be optimal,
also if hardware timings change radically in the future then these prefetch
instructions would be badly timed. One possible idea would be to introduce a
compiler option so some kind of delay in nanoseconds (or even better
picoseconds in case memory chip latency falls below 1 nanosecond,
alternatively a 64 bit floating point number in seconds could also be used
perhaps this would give even more range, but could also introduce
imprecision/doubts/side effects(?) this could also mean the compiler needs
floating point support, a bit high requirement it seems to maybe using a 32
bit or 64 bit integer would be easier) could be specified or so and the
compiler would take that into account.
Another more radical approach would be to insert the prefetch instructions
into the computer program at runtime, but this would require self modifieing
code and is probably way to futuristic/complex to program at this moment.
Finally if you do believe prefetching is a good alternative to a pointer
cache or simply want to try out the effect of prefetching class pointers
then I challenge you to implement it in a experimental branch of the free
pascal compiler, so the prefetch idea can be tried out and tested !
One last note about the pointer cache idea, one possible way of introducing
this idea would be to add some bit to the addressing modes/parameters of
instructions.
For each parameter it could be specified if it's a pointer to be stored in
the pointer cache or not. This way the extra bits to encode this information
is limited to the instructions and doesn't require extra data bits in memory
like the original pointer idea where the high bit of a pointer was used.
That original idea is probably a bit too risky/shacky, but this new
implementation idea is probably possible to some extent. However since x86
is somewhat a fixed/set in stone instruction set I doubt if it will be
possible, but perhaps in the future when a 128 bit instruction set is
introduced there might be some opportunity to introduce these bits into the
new part of the 128 bit instruction set.
Alternatively it would be somewhat nice if the processor was capable of
detecting if an instruction parameter is a pointer yes or no, but this then
still leaves the problem: to cache or not to cache ? This information should
probably come from the programmer itself, hence these additional parameter
bits.
However perhaps somebody can come up with a nice way/algorithm for a
processor to determine if a parameter is pointing toward a class. I do fear
that such an algorithm will probably depend on a "object address table"
which would contain memory addresses of the object which are pointed towards
by the classes. And then the processor would compare the instruction address
with the object address table and when it matches retrieves the pointer from
the data cache. But since the object address table is already stored in
processor memory the pointer cache is double and becomes irrelevant.
Thus another more simply idea would be the following:
Instead of a pointer cache, the processor has an "object address table"
which can be filled by the program, for example by an api or some special
instructions.
Each time the processor executes an instruction which needs to retrieve some
data items from memory the "object address table" is consulted. If the
address is already in the table with the additional content of the address
then the content of the address is fetched from the "object address table".
Which can probably simply be renamed to "Pointer table".
This more or less already represents what the data cache is doing. Except
there is one difference. The programmer is more in control of the pointer
table or let's simply call it a "pointer cache" and can fill it with data
manually. The processor itself has no control over the pointer cache. The
processor will not trash the pointer cache... instead the processor will
completely rely on the program/programmer to fill it with the necessary data
for fast operation. It should also be possible to change the pointer cache
at runtime of an application/program for adjustment.
This is a pretty cool idea and could probably be extended to more than just
"pointers".
It could also be called a "user data cache" or "custom data cache".
So the processor's L1 cache and/or L2 cache and/or L3 cache could be split
up into an "automatic cache" and a "manual cache".
I would also like to note that nvidia's GPUs have a similar concept called
"shared cache" and allows the programmer control over the cache.
A feature which is actively being used by programmers with varying degrees
of success !
Bye,
Skybuck.
More information about the fpc-devel
mailing list