[fpc-devel] String and UnicodeString and UTF8Stringt
DrDiettrich1 at aol.com
Wed Jan 12 12:46:41 CET 2011
>> ...: the new ansistring type has a hidden "element size" field (in
>> addition to the reference count, length and codepage), and from what I
>> can see at page 10 of
>> Delphi 2009's unicodestring is simply an ansistring(1200).
> So it seems, that if we will have any "GenericString", with properties
> "reference count", "size", "character width", "codepage", then all other
> string types can be based on this string type. So other strings will be
> only any "shortcuts", and internaly will use same structure:
> AnsiString = GenericString(with actual system ANSI code page (0) ... or
> ... without any explicit codepage ($ffff))
> UTF8String = GenericString(with UTF-8 encoding)
> UnicodeString = GenericString(with UTF-16 encoding)
Nice from management view, but resulting in an ugly implementation.
Apart from the generic form of (internal) subroutines we still need
explicit code for most variations. Also translation tables for *all*
codpages must become part of every executable.
A true polymorphic string class (or equivalent) would be more
performant, and would allow to add only really used codepages to the
applications. Such an implementation could add another VMT pointer to
the string prefix, and the UnicodeString could be implemented by a
simple type cast from any (generic) string reference into a class reference.
> Where is not agreement, it is fact what should be default string
> encoding (AnsiString($ffff) or UTF-8 or UTF-16 or UTF-32)
The default (internal) string type must be an UTF type, else losses are
inevitable during (implicit) conversions. This means that SBCS
AnsiString never can become the default encoding.
The default type could be made platform dependent, so that UTF-16 would
be used for Windows and UTF-8 for Linux platforms. But this will cause
problems with code that assumes exactly one of these encodings, and uses
indexed access to characters, when such code is recompiled for a
platform with a different default encoding. The introduction of another
type OSString or TFileName can eliminate many implicit conversions in
passing such strings to subroutines, but OTOH can cause slowdown of all
other operations with that string type.
I'd ban indexed access at all, in the future, unless the default
encoding is UTF-32; else the user has to accept an possible more or less
significant slowdown of his code, what stands in contrast to the
*intented* optimization by direct (indexed) access to the string content.
Delphi has eliminated that discussion by declaring the (default)
UnicodeString fixed to UTF-16, for all targets. The only remaining
question is, whether this was the best choice at all.
> P.S. I still does not understand, how can things work correctly if LCL
> expect that all AnsiStrings (String) are UTF8Strings, byt RTL/FCL does
> not strictly follow this (at least in Windows) ?
Right, UTF8String should be really different from AnsiString, so that
all eventually required conversions can be inserted by the compiler.
More information about the fpc-devel