[fpc-devel] RTL Unicode support

Fri Aug 24 13:08:08 CEST 2012

Here's what I feel necessary for dealing with Unicode in all encodings.

The Pos function demonstrates the problems with indexed string access. 
When automatic conversion of the string occurs, i.e. an AnsiString is 
converted into an UnicodeString (UTF-16), the returned index into the 
converted string cannot be used with the original string.

This means that Pos(Substr, S) must come in distinct overloaded 
functions, for UnicodeString and RawByteString strings S. Only the 
Substr argument may be converted to the encoding of S, the encoding of S 
must be preserved in any case.

When the compiler cannot assure that S is unchanged, we'll end up in 3 
versions:
Pos(UnicodeString, UnicodeString)
Pos(RawByteString, RawByteString)
Pos(UnicodeString, RawByteString)

The latter version is tricky to implement, it even may require compiler 
magic. The problem is the conversion of the Substr into the encoding of 
S, for which a function like MakeCompatibleAnsiString(Substr, S) may be 
desireable, returning an RawByteString containing Substr in the encoding 
of S.

The same considerations apply to StringReplace and all other functions 
with more than one string argument. Here the problems are less critical, 
because these functions can always return a valid UnicodeString result, 
which can be converted back into an AnsiString automatically, if 
required. Nontheless these conversions should be avoided if ever possible.

Obviously no such problems exist when the *default* string type is 
UnicodeString (UTF-16), since then only lossless conversions into UTF-16 
may be required. Otherwise, when one AnsiString encoding is converted 
into another one, this is done by a conversion into Unicode first, 
followed by another conversion into the final AnsiString.

How does this change when the default string type is UTF-8, instead of 
UTF-16?
In this case the compiler will automatically convert incompatible 
strings into UTF-8, so that the Pos problem persists, and the number of 
automatic Unicode/AnsiString conversions remains the same, only in the 
opposite directions.

Does this increase the number of recommended overloads?
Now we'll have to consider:
   f(String) = f(UTF8String)
   f(UnicodeString) = f(UTF16String)
   f(AnsiString(0)) if CP_ACP is not UTF-8
   f(RawByteString) for all non-UTF-8 AnsiStrings
IMO the last variation still can be omitted, due to the automatic 
conversion into UTF-8 (the default String type).

Delphi however supplies overloaded stringhandling for UTF-16 
(String=UnicodeString) and CP_ACP (AnsiString(0)), assuming that all 
(input) strings have already been converted into one of these encodings.
In FPC this would require another overload version, when CP_ACP is not 
UTF-8 and String is not UncodeString.

My conclusion, so far:

FPC should either stay with String=AnsiString on every platform, because 
this requires only one set of AnsiString components, still covering full 
Unicode in UTF-8 encoding,

or when Delphi compatible UnicodString(UTF-16) is introduced, this also 
should become the default String type on every platform, and for all 
components. Another pro WRT to indexing: since UTF-16 strings never 
deserve an automatic encoding conversion, an index into such an string 
cannot become invalid.

The third alternative were an RawByteString which could hold either Ansi 
(AnsiChar) or UTF-16 (WideChar) data, as originally intended by 
Embarcadero, but dropped later. This model *could* adopt itself to the 
concrete platform, when all strings tend to become the platform default 
encoding sooner or later. But I doubt that this will really eliminate 
excess conversions, and indexing such strings will become *very* unreliable.

DoDi