[fpc-devel] String and UnicodeString and UTF8Stringt

Wed Jan 12 12:46:41 CET 2011

LacaK schrieb:

>> ...: the new ansistring type has a hidden "element size" field (in 
>> addition to the reference count, length and codepage), and from what I 
>> can see at page 10 of 
>> http://edn.embarcadero.com/article/images/38980/Delphi_and_Unicode.pdf, 
>> Delphi 2009's unicodestring is simply an ansistring(1200).
> So it seems, that if we will have any "GenericString", with properties 
> "reference count", "size", "character width", "codepage", then all other 
> string types can be based on this string type. So other strings will be 
> only any "shortcuts", and internaly will use same structure:
> AnsiString = GenericString(with actual system ANSI code page (0) ... or 
> ... without any explicit codepage ($ffff))
> UTF8String = GenericString(with UTF-8 encoding)
> UnicodeString = GenericString(with UTF-16 encoding)

Nice from management view, but resulting in an ugly implementation. 
Apart from the generic form of (internal) subroutines we still need 
explicit code for most variations. Also translation tables for *all* 
codpages must become part of every executable.

A true polymorphic string class (or equivalent) would be more 
performant, and would allow to add only really used codepages to the 
applications. Such an implementation could add another VMT pointer to 
the string prefix, and the UnicodeString could be implemented by a 
simple type cast from any (generic) string reference into a class reference.

> Where is not agreement, it is fact what should be default string 
> encoding (AnsiString($ffff) or UTF-8 or UTF-16 or UTF-32)

The default (internal) string type must be an UTF type, else losses are 
inevitable during (implicit) conversions. This means that SBCS 
AnsiString never can become the default encoding.

The default type could be made platform dependent, so that UTF-16 would 
be used for Windows and UTF-8 for Linux platforms. But this will cause 
problems with code that assumes exactly one of these encodings, and uses 
indexed access to characters, when such code is recompiled for a 
platform with a different default encoding. The introduction of another 
type OSString or TFileName can eliminate many implicit conversions in 
passing such strings to subroutines, but OTOH can cause slowdown of all 
other operations with that string type.

I'd ban indexed access at all, in the future, unless the default 
encoding is UTF-32; else the user has to accept an possible more or less 
significant slowdown of his code, what stands in contrast to the 
*intented* optimization by direct (indexed) access to the string content.

Delphi has eliminated that discussion by declaring the (default) 
UnicodeString fixed to UTF-16, for all targets. The only remaining 
question is, whether this was the best choice at all.

> P.S. I still does not understand, how can things work correctly if LCL 
> expect that all AnsiStrings (String) are UTF8Strings, byt RTL/FCL does 
> not strictly follow this (at least in Windows) ?

Right, UTF8String should be really different from AnsiString, so that 
all eventually required conversions can be inserted by the compiler.

DoDi