[fpc-pascal] Unicode file routines proposal

Tue Jul 1 10:35:00 CEST 2008

On Tue, 1 Jul 2008 10:23:32 +0200
Martin Schreiber <fpmse at bluewin.ch> wrote:

> On Tuesday 01 July 2008 09.56:29 Mattias Gaertner wrote:
> > On Tue, 01 Jul 2008 09:35:35 +0200
> >
> > Luca Olivetti <luca at ventoso.org> wrote:
> > > OTOH using variable length characters will make string operations
> > > expensive (since you can't just multiply the index by 2 or 4 but
> > > you have to examine the string from the beginning, and the length
> > > in bytes isn't the same as the length in characters).
> >
> > It's amazing that this argument come up again and again. But I know
> > hardly any code that need this index to char mapping. And the code,
> > that need it is seldom time critical.
> > (I must admit, I feared the same some years ago. But the extra cost
> > is practically a myth.)
> >
> A good example is text layout calculation where it is necessary to
> iterate over characters (glyphs) over and over again. 

Text layout nowadays need to consider font widths and unicode specials.
Iterating from character to character should be hardly measurable
compared to this. For example synedit does not yet care much about font
widths and unicode specials and the UTF-8 stepping is negligible.

> MSEgui uses
> widestrings directly, fpGUI converts to widestrings before processing
> (or use they the slow utf-8 routines ?). I once switched MSEgui to
> utf-8 because of the widestring problems in FPC, one or two months
> later when I implemented complex layout calculation with tabulators
> and justified text I switched back to widestrings...
> This belongs to a GUI framework, for a RTL are possibly other
> priorities.
> 
> >
> > Most code only needs the number of bytes. And this still cost under
> > pascal O(1).
> > In fact if a UTF8String or UTF16String would be added, then I would
> > say, it would be a waste of memory to store an extra PtrInt for the
> > number of characters.
> >
> Agreed.
> I think the best compromise for a GUI framework are referencecounted 
> widestrings where normally physical index = code point index. If one
> needs characters which are not in the base plane, he must use
> surrogate pairs and more complicated and slower processing. I assume
> this will be seldom used.

It depends if your code should solve a special problem or if you
write a library that should work for all. The RTL and FCL should work
for all. So they must support UTF-16 and can not use a
limited widestring.

Mattias