[fpc-devel] Re: enumerators
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Tue Nov 16 02:41:28 CET 2010
Marco van de Voort schrieb:
> First you would have to come up with a workable model for s[x] being
> utf32chars in general that doesn't suffer from O(N^2) performance
> degradation (read/write)
Right, UTF-32 or UCS2 were much more useful in computations.
> And for it to be useful, it must be workable for more than only the most
> basic loops, but also for e.g.
>
> if length(s)>1 then
> for i:=1 to length(s)-1 do
> s[i]:=s[i+1];
>
> and similar loops that might read/write more than once, and use
> calculated expressions as the parameter to []
Okay, essentially you outlined why UTF is not a useful encoding for
computation at all. Above loop body results in O(N^2) for every loop
type, be counted or iterated, on data structures with non-uniform elements.
UTF encodings have their primary use in data storage and exchange with
external API's. A useful data type and implementation would use a SxCS
encoding internally, and accept or supply UTF-8 or UTF-16 strings only
when explicitly asked for. All meaningful UTF/MBCS implementations
already come with iterators, and only uneducated people would ever try
to apply their SBCS-based algorithms and habits on MBCS encodings. At
least they'd change their mind soon, after encountering the first bugs
resulting from such an inappropriate approach.
BTW, I just found a similar inappropriate handling of digraphs in the
scanner, where checks for such character combinations occur in many
places, with no guarantee that all cases really are covered.
Furthermore I think that in detail Unicode string handling should not be
based on single characters at all, but instead should use (sub)strings
all over, covering multibyte character representations, ligatures etc.
as well. Then the basic operations would be insertion and deletion of
substrings, in addition to substring extraction and concatenation.
DoDi
More information about the fpc-devel
mailing list