[fpc-devel] RTL Unicode support
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Fri Aug 24 13:08:08 CEST 2012
Here's what I feel necessary for dealing with Unicode in all encodings.
The Pos function demonstrates the problems with indexed string access.
When automatic conversion of the string occurs, i.e. an AnsiString is
converted into an UnicodeString (UTF-16), the returned index into the
converted string cannot be used with the original string.
This means that Pos(Substr, S) must come in distinct overloaded
functions, for UnicodeString and RawByteString strings S. Only the
Substr argument may be converted to the encoding of S, the encoding of S
must be preserved in any case.
When the compiler cannot assure that S is unchanged, we'll end up in 3
versions:
Pos(UnicodeString, UnicodeString)
Pos(RawByteString, RawByteString)
Pos(UnicodeString, RawByteString)
The latter version is tricky to implement, it even may require compiler
magic. The problem is the conversion of the Substr into the encoding of
S, for which a function like MakeCompatibleAnsiString(Substr, S) may be
desireable, returning an RawByteString containing Substr in the encoding
of S.
The same considerations apply to StringReplace and all other functions
with more than one string argument. Here the problems are less critical,
because these functions can always return a valid UnicodeString result,
which can be converted back into an AnsiString automatically, if
required. Nontheless these conversions should be avoided if ever possible.
Obviously no such problems exist when the *default* string type is
UnicodeString (UTF-16), since then only lossless conversions into UTF-16
may be required. Otherwise, when one AnsiString encoding is converted
into another one, this is done by a conversion into Unicode first,
followed by another conversion into the final AnsiString.
How does this change when the default string type is UTF-8, instead of
UTF-16?
In this case the compiler will automatically convert incompatible
strings into UTF-8, so that the Pos problem persists, and the number of
automatic Unicode/AnsiString conversions remains the same, only in the
opposite directions.
Does this increase the number of recommended overloads?
Now we'll have to consider:
f(String) = f(UTF8String)
f(UnicodeString) = f(UTF16String)
f(AnsiString(0)) if CP_ACP is not UTF-8
f(RawByteString) for all non-UTF-8 AnsiStrings
IMO the last variation still can be omitted, due to the automatic
conversion into UTF-8 (the default String type).
Delphi however supplies overloaded stringhandling for UTF-16
(String=UnicodeString) and CP_ACP (AnsiString(0)), assuming that all
(input) strings have already been converted into one of these encodings.
In FPC this would require another overload version, when CP_ACP is not
UTF-8 and String is not UncodeString.
My conclusion, so far:
FPC should either stay with String=AnsiString on every platform, because
this requires only one set of AnsiString components, still covering full
Unicode in UTF-8 encoding,
or when Delphi compatible UnicodString(UTF-16) is introduced, this also
should become the default String type on every platform, and for all
components. Another pro WRT to indexing: since UTF-16 strings never
deserve an automatic encoding conversion, an index into such an string
cannot become invalid.
The third alternative were an RawByteString which could hold either Ansi
(AnsiChar) or UTF-16 (WideChar) data, as originally intended by
Embarcadero, but dropped later. This model *could* adopt itself to the
concrete platform, when all strings tend to become the platform default
encoding sooner or later. But I doubt that this will really eliminate
excess conversions, and indexing such strings will become *very* unreliable.
DoDi
More information about the fpc-devel
mailing list