[fpc-devel] Unicode support (yet again)

Sun Sep 18 21:25:25 CEST 2011

DaWorm schrieb:
> On Sun, Sep 18, 2011 at 12:01 PM, Sven Barth
> <pascaldragon at googlemail.com> wrote:
>> On 18.09.2011 17:48, DaWorm wrote:
> 
> But isn't it O(n^2) only when actually using unicode strings?

All MBCS encodings, with no fixed character size, suffer from that problem.

> Wouldn't you also be able to do something like String.Encoding := Ansi
> and then all String[i] accesses would then be o(n) + x (where x is the
> overhead of run time checking that it is safe to just use a memory
> offset, presumably fairly short)? Of course it would be up to the user
> to choose to reencode some string he got from the RTL or FCL that way
> and understand the consequences.

Calling subroutines for indexed access, instead of direct array access, 
will add another factor (10..100?) to single character access - 
including register save/restore and disallowed optimizations.

> What assumptions are the typical String[i] user going to make about
> what is returned?  There will be the types that are seeing if the
> fifth character is a 'C' or something like that, and for those there
> probably isn't too much that is going to go wrong, they might have to
> switch to "C" instead, or the compiler can make the 'C' literal a
> "unicode char which is really a string" conversion at compile time.
> There may be the ones that want to turn a 'C' into a 'c' by flipping
> the 6th bit, and that will indeed break, and in a Unicode world,
> perhaps that should break, forcing using LowerCase as needed.

The simple upper/lower conversion works only for ASCII, not for Ansi chars.

>  And
> there are those (such as myself) who often use strings as buffers for
> things like serial comms.  That code will totally break if I were to
> try to use a unicode string buffer, but a simple addition of
> String.Encoding := ANSI or RawByteString or ShortString in the first
> line would fix that, or I could bite the bullet and recode that quick
> and dirty code the right way. 

Delphi introduced TBytes for non-character byte data.

> My point is that trying to keep the bad
> habits of a single byte string world in a unicode world is
> counterproductive.  They aren't the same, and all attempts to make
> them the same just cause more problems than they solve.

That's why I still suggest to use UTF-16 in user code. When the user 
skips all unknown chars, nothing can go wrong.

> As for the RTL and FCL, presumably they wouldn't be doing any of this
> Sting[i] stuff in the first place, would they? So they aren't going to
> suffer that speed penalty.  Just because one type of code is slow,
> doesn't mean everything is slow.

It's absolutely safe, even with UTF-8 strings, to e.g. search for all 
'\' separators, and to replace these in place with '/'. It's also safe 
to search for an set of (ASCII) separator chars, and to split strings at 
these positions (e.g. CSV). Bytewise case-insensitive comparison also 
works for all encodings, at least for equality. Other comparisons are 
much slower, due to the required lookup of the sort order values (maybe 
alphabetic, dictionary etc.), and again with every encoding. Even with 
ASCII there exists a choice of sorting 'a' like 'A', after 'A' or after 'Z'.

DoDi