[fpc-devel] Re: enumerators

Tue Nov 16 10:20:35 CET 2010

In our previous episode, Hans-Peter Diettrich said:
> > First you would have to come up with a workable model for s[x] being
> > utf32chars in general that doesn't suffer from O(N^2) performance
> > degradation (read/write)
> 
> Right, UTF-32 or UCS2 were much more useful in computations.

I said s[x] _returning_ utf-32. I didn't say S was an array of utf32 chars.

And no, IMHO UCS2 as array structure is unwise as it is only a subset, and
UTF32 is too wasteful.

> > And for it to be useful, it must be workable for more than only the most
> > basic loops, but also for e.g.
> > 
> > if length(s)>1 then
> >   for i:=1 to length(s)-1 do
> >     s[i]:=s[i+1];
> > 
> > and similar loops that might read/write more than once, and use
> > calculated expressions as the parameter to []
> 
> Okay, essentially you outlined why UTF is not a useful encoding for 
> computation at all.

No I didn't. I explained why having the bulk of your stringroutines
expressed with array indexing is not a good idea in the first place.  But
that might be considered a legacy problem.

> Above loop body results in O(N^2) for every loop type, be counted or
> iterated, on data structures with non-uniform elements.

Yes, but the realisation should be that the holding on array indexing is
what makes it expensive. The problem could be strongly reduced by removing
such array indexing skeleton from all routines where it is not necessary.

> UTF encodings have their primary use in data storage and exchange with 
> external API's.

And in memory.

> A useful data type and implementation would use a SxCS 
> encoding internally, and accept or supply UTF-8 or UTF-16 strings only 
> when explicitly asked for. All meaningful UTF/MBCS implementations 
> already come with iterators, 

Depends on what you mean with iterators but yes some simple inline routines,
one that goes to the next char, and one to load a char will do.

> and only uneducated people would ever try to apply their SBCS-based
> algorithms and habits on MBCS encodings.

That's the main problem. That will have to change anyway. 

> Furthermore I think that in detail Unicode string handling should not be 
> based on single characters at all, but instead should use (sub)strings 
> all over, covering multibyte character representations, ligatures etc. 
> as well

This is dog slow. You can make such library for special purposes, but for
most day to day use this is overkill.

The most common stringoperations that the avg programmer does is searching for
substrings and then split on them, something that can be perfectly done in
UTF-8.

> Then the basic operations would be insertion and deletion of 
> substrings, in addition to substring extraction and concatenation.

Basic operations with capital B yes, string support like in a Basic
interpreter.