[fpc-pascal] Unicode file routines proposal

Martin Schreiber fpmse at bluewin.ch
Tue Jul 1 10:23:32 CEST 2008


On Tuesday 01 July 2008 09.56:29 Mattias Gaertner wrote:
> On Tue, 01 Jul 2008 09:35:35 +0200
>
> Luca Olivetti <luca at ventoso.org> wrote:
> > OTOH using variable length characters will make string operations
> > expensive (since you can't just multiply the index by 2 or 4 but you
> > have to examine the string from the beginning, and the length in
> > bytes isn't the same as the length in characters).
>
> It's amazing that this argument come up again and again. But I know
> hardly any code that need this index to char mapping. And the code,
> that need it is seldom time critical.
> (I must admit, I feared the same some years ago. But the extra cost is
> practically a myth.)
>
A good example is text layout calculation where it is necessary to iterate 
over characters (glyphs) over and over again. MSEgui uses widestrings 
directly, fpGUI converts to widestrings before processing (or use they the 
slow utf-8 routines ?). I once switched MSEgui to utf-8 because of the 
widestring problems in FPC, one or two months later when I implemented 
complex layout calculation with tabulators and justified text I switched back 
to widestrings...
This belongs to a GUI framework, for a RTL are possibly other priorities.

>
> Most code only needs the number of bytes. And this still cost under
> pascal O(1).
> In fact if a UTF8String or UTF16String would be added, then I would
> say, it would be a waste of memory to store an extra PtrInt for the
> number of characters.
>
Agreed.
I think the best compromise for a GUI framework are referencecounted 
widestrings where normally physical index = code point index. If one needs 
characters which are not in the base plane, he must use surrogate pairs and 
more complicated and slower processing. I assume this will be seldom used.

Martin



More information about the fpc-pascal mailing list