[fpc-pascal] Unicode file routines proposal

Tue Jul 1 09:56:29 CEST 2008

On Tue, 01 Jul 2008 09:35:35 +0200
Luca Olivetti <luca at ventoso.org> wrote:

> En/na Marco van de Voort ha escrit:
> >>> They have a UTF-16/UCS-2 internal representation, same as MSEgui
> >>> which works very well and is fast and handy BTW.
> >> And len, slicing, etc. work as expected.
> >> Note that if you need characters beyond $ffff you have to compile
> >> it with wide unicode support, and in that case every character
> >> will use 4 bytes.
> >>
> > That's IMHO a faulty system. It requires you to choose between an
> > incomplete solution or making strings a horrible memory hog.
> 
> OTOH using variable length characters will make string operations 
> expensive (since you can't just multiply the index by 2 or 4 but you 
> have to examine the string from the beginning, and the length in
> bytes isn't the same as the length in characters).

It's amazing that this argument come up again and again. But I know
hardly any code that need this index to char mapping. And the code,
that need it is seldom time critical.
(I must admit, I feared the same some years ago. But the extra cost is
practically a myth.)

> > But maybe that doesn't
> > matter for mere scripting languages (though I wonder then why they
> > didn't chose UTF-32 directly)
> > 
> > Surrogates are not nice, but they were invented for a reason.
> 
> Well, yes, they're a trade-off between performance and memory 
> consumption, but I fear we're losing one of the advantages that
> pascal has over C: fast and simple string handling.

Most code only needs the number of bytes. And this still cost under
pascal O(1).
In fact if a UTF8String or UTF16String would be added, then I would
say, it would be a waste of memory to store an extra PtrInt for the
number of characters.

Mattias