[fpc-devel] Patch, font rendering on Arm-Linux devices.

Thu Feb 28 10:21:31 CET 2008

Micha Nelissen wrote:
> In addition to what the others said, think of it like your 32 bit 
> processor suddenly being a 8 bit processor: it has to manually load 4 
> times 8 bit, arrange them into a 32 bit value, and only then use it. 
> With non packed, it can use the value directly.
With an x86 no additional code needs to be created by the compiler, as 
it _can_ do misaligned accesses (there are other processors that can't 
and need more code).

If it accesses a misaligned 32 bit value it does two accesses (not 4): 
e.g. once 8 bit and once 24 bit (when reading each of the accesses is 
the same 32 bit, anyway).

But all this is only internal in the core of the chip and thus _very_ 
fast, as the chip contains a (1st level) cache and same is connected to 
the second level cache (also within the chip) with a 128 bit or more 
data path.

Transferring data from/to the 1st level cache imposes a lot more delay 
than the misaligned access. Thus if there are many instances of a record 
variable that are used for calculation, it might be much faster to use 
the packed version. If there are only a few, usually the unpacked 
version should be faster.

-Michael