[fpc-devel] Memory consumed by strings

Sun Nov 23 14:45:26 CET 2008

On 2008-11-23 14:34, Mattias Gaertner wrote:
> On Sun, 23 Nov 2008 14:11:50 +0200
> listmember<listmember at letterboxes.org>  wrote:

>> That leaves me wondering how much do we lose performance-wise in
>> endlessly decompressing UTF-8 data, instead of using, say, UCS-4
>> strings.
>
> I'm wondering what you mean with 'endlessly decompressing UTF-8
> data'.

I am referring to going to the nth character in a string. With UTF-8 it 
is no more a simple arithmetic and an index operation. You have to start 
from zero and iterate until you get to your characters --at every step, 
calculating whether it is 2, 3 or 4 bytes long. Doing this is decompression.

> You have to make a compromise between memory, ease of use and
> compatibility. There is no solution without drawbacks.
>
> If you want to process large 8bit text files then UTF-8 is better.
> If you want to paint glyphs then normalized UTF-32 is better.
> If you want some unicode with some mem overhead and some easy usage and
> have compiler support for some compatibility then UTF-16 is better.

Do we have to think in terms of encodings (which are, ways of 
compressing text) when what we actually mean 1-byte, 2-byte and 4-byte 
per char strings.