[fpc-devel] Encoded AnsiString

Sun Dec 29 19:26:57 CET 2013

Jonas Maebe schrieb:

> The code page of ansistrings concatenations is the code page of the
> result to which this concatenation is assigned/converted. For
> rawbytestring, this code page is CP_ACP per Delphi compatibility.

This does not match my experience with Delphi XE :-(

Can you give an Delphi example, so that I can verify this behaviour?

> I'm inclined to add a global boolean variable to the system unit that
> allows changing this behaviour so that it uses CP_UTF8 instead in
> such cases (defaulting to false, for Delphi compatibility). In
> practice, setting it to true shouldn't cause problems even with
> virtually all Delphi, as routines that work with rawbytestring should
> be able to handle any code page anyway.

The Result of an f(...):RawByteString should return an string of that 
encoding, that results from its construction.

My view on RawByteString:

1) This type serves as a collector for AnsiStrings of any encoding, 
where otherwise a conversion into UTF-16 (string) or CP_ACP (AnsiString) 
were required.

2) Variables of type RawByteString are intended only as *local* 
variables, inside subroutines dealing with RawByteStrings.

3) Functions accepting RawByteStrings can provide fast results, when the 
encoding of the string arguments is the same, otherwise they have to use 
Unicode (UTF-8/16) for intermediate results.

Rationale/observations:

[1] Delphi: Only UTF-16 and CP_ACP are explicitly supported in 
overloaded stringhandling functions. This would require to convert all 
string arguments other than AnsiString(0) into UTF-16. A RawByteString 
overload (instead of AnsiString(0)) allows to process an AnsiString(x) 
without UTF-16 conversion, when the function code and argument encodings 
do not require such a conversion. Otherwise the RawByteString overloads 
convert all strings into UTF-16 internally, and back again into a 
RawByteString Result. Since UTF-8 is not a specifically supported 
encoding, UTF-16 must be converted back to CP_ACP instead, with possible 
losses.

In fact the AnsiString(0) overloads in AnsiStrings.pas are another 
optimization, that does not check the encoding of the string arguments, 
eventual conversions are assumed to be performed before. This leads to 
errors when the declared (static) string type of an parameter does not 
match its actual (dynamic) encoding. Such irregular strings can be 
constructed by wrong/unexpected use of RawByteString. Example (XE):

var a: AnsiString; u: UTF8String;
function cpy(s: RawByteString):RawByteString;
begin Result := s; end;
a := cpy(u); //now a has encoding UTF-8!

Here the XE compiler omits the conversion of the RawByteString result to 
the declared encoding of the target. Dunno about newer versions.

[3] Delphi: since the only explicitly supported lossless encoding is 
UTF-16, RawByteString stringhandling functions with arguments of mixed 
encodings must be converted to UTF-16, finally back to AnsiString. Here 
a conversion to CP_ACP may occur, when/because the further use of a 
RawByteString result is unknown. Delphi does not provide UTF-8 
overloads, so that this encoding cannot be used when an UnicodeString 
has to be converted into an RawByteString.

FPC: when UTF-8 is used inside RawByteString routines, instead of 
UTF-16, the RawByteString result can have exactly this encoding, for 
lossless handling in further calls, until the result finally is assigned 
to a variable/parameter of a fixed encoding. In detail no conversion to 
CP_ACP is required when UTF-8 is a supported by overloads, or as a 
special case of RawByteString arguments.

So IMO there exists no *requirement*, that intermediate Unicode strings 
have to be converted to CP_ACP as RawByteString Results. This is only a 
fatal consequence of the crippled Delphi handling of encodings 
(disregarding UTF-8), with possible conversion losses. When UTF-8 is 
used for intermediate Unicode strings, the RawByteString results can 
preserve lossless UTF-8 encoding.

DoDi