[fpc-devel] Re: enumerators

Hans-Peter Diettrich DrDiettrich1 at aol.com
Tue Nov 16 02:41:28 CET 2010

Marco van de Voort schrieb:

> First you would have to come up with a workable model for s[x] being
> utf32chars in general that doesn't suffer from O(N^2) performance
> degradation (read/write)

Right, UTF-32 or UCS2 were much more useful in computations.

> And for it to be useful, it must be workable for more than only the most
> basic loops, but also for e.g.
> if length(s)>1 then
>   for i:=1 to length(s)-1 do
>     s[i]:=s[i+1];
> and similar loops that might read/write more than once, and use
> calculated expressions as the parameter to []

Okay, essentially you outlined why UTF is not a useful encoding for 
computation at all. Above loop body results in O(N^2) for every loop 
type, be counted or iterated, on data structures with non-uniform elements.

UTF encodings have their primary use in data storage and exchange with 
external API's. A useful data type and implementation would use a SxCS 
encoding internally, and accept or supply UTF-8 or UTF-16 strings only 
when explicitly asked for. All meaningful UTF/MBCS implementations 
already come with iterators, and only uneducated people would ever try 
to apply their SBCS-based algorithms and habits on MBCS encodings. At 
least they'd change their mind soon, after encountering the first bugs 
resulting from such an inappropriate approach.

BTW, I just found a similar inappropriate handling of digraphs in the 
scanner, where checks for such character combinations occur in many 
places, with no guarantee that all cases really are covered.

Furthermore I think that in detail Unicode string handling should not be 
based on single characters at all, but instead should use (sub)strings 
all over, covering multibyte character representations, ligatures etc. 
as well. Then the basic operations would be insertion and deletion of 
substrings, in addition to substring extraction and concatenation.


More information about the fpc-devel mailing list