[fpc-devel] Encoded AnsiString
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Sun Dec 29 19:26:57 CET 2013
Jonas Maebe schrieb:
> The code page of ansistrings concatenations is the code page of the
> result to which this concatenation is assigned/converted. For
> rawbytestring, this code page is CP_ACP per Delphi compatibility.
This does not match my experience with Delphi XE :-(
Can you give an Delphi example, so that I can verify this behaviour?
> I'm inclined to add a global boolean variable to the system unit that
> allows changing this behaviour so that it uses CP_UTF8 instead in
> such cases (defaulting to false, for Delphi compatibility). In
> practice, setting it to true shouldn't cause problems even with
> virtually all Delphi, as routines that work with rawbytestring should
> be able to handle any code page anyway.
The Result of an f(...):RawByteString should return an string of that
encoding, that results from its construction.
My view on RawByteString:
1) This type serves as a collector for AnsiStrings of any encoding,
where otherwise a conversion into UTF-16 (string) or CP_ACP (AnsiString)
were required.
2) Variables of type RawByteString are intended only as *local*
variables, inside subroutines dealing with RawByteStrings.
3) Functions accepting RawByteStrings can provide fast results, when the
encoding of the string arguments is the same, otherwise they have to use
Unicode (UTF-8/16) for intermediate results.
Rationale/observations:
[1] Delphi: Only UTF-16 and CP_ACP are explicitly supported in
overloaded stringhandling functions. This would require to convert all
string arguments other than AnsiString(0) into UTF-16. A RawByteString
overload (instead of AnsiString(0)) allows to process an AnsiString(x)
without UTF-16 conversion, when the function code and argument encodings
do not require such a conversion. Otherwise the RawByteString overloads
convert all strings into UTF-16 internally, and back again into a
RawByteString Result. Since UTF-8 is not a specifically supported
encoding, UTF-16 must be converted back to CP_ACP instead, with possible
losses.
In fact the AnsiString(0) overloads in AnsiStrings.pas are another
optimization, that does not check the encoding of the string arguments,
eventual conversions are assumed to be performed before. This leads to
errors when the declared (static) string type of an parameter does not
match its actual (dynamic) encoding. Such irregular strings can be
constructed by wrong/unexpected use of RawByteString. Example (XE):
var a: AnsiString; u: UTF8String;
function cpy(s: RawByteString):RawByteString;
begin Result := s; end;
a := cpy(u); //now a has encoding UTF-8!
Here the XE compiler omits the conversion of the RawByteString result to
the declared encoding of the target. Dunno about newer versions.
[3] Delphi: since the only explicitly supported lossless encoding is
UTF-16, RawByteString stringhandling functions with arguments of mixed
encodings must be converted to UTF-16, finally back to AnsiString. Here
a conversion to CP_ACP may occur, when/because the further use of a
RawByteString result is unknown. Delphi does not provide UTF-8
overloads, so that this encoding cannot be used when an UnicodeString
has to be converted into an RawByteString.
FPC: when UTF-8 is used inside RawByteString routines, instead of
UTF-16, the RawByteString result can have exactly this encoding, for
lossless handling in further calls, until the result finally is assigned
to a variable/parameter of a fixed encoding. In detail no conversion to
CP_ACP is required when UTF-8 is a supported by overloads, or as a
special case of RawByteString arguments.
So IMO there exists no *requirement*, that intermediate Unicode strings
have to be converted to CP_ACP as RawByteString Results. This is only a
fatal consequence of the crippled Delphi handling of encodings
(disregarding UTF-8), with possible conversion losses. When UTF-8 is
used for intermediate Unicode strings, the RawByteString results can
preserve lossless UTF-8 encoding.
DoDi
More information about the fpc-devel
mailing list