[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Wed Nov 26 18:37:04 CET 2014

Michael Schnell schrieb:
> I fail to understand some of the text.
> 
> It seems to be unavoidable to use the name "ANSIString" even though I 
> always though up when seeing a thing called "ANSI" containing Unicode 
> (e. g.   "UTF8String = type AnsiString(CP_UTF8)" ).
> 
> 
> Seemingly here the "bytes per character" setting implicitly is thought 
> of as a port of the "code-page" definition. correct ?

An AnsiString consists of AnsiChar's. The *meaning* of these char's 
(bytes) depends on their encoding, regardless of whether the used 
encoding is or is not stored with the string.

It's essential to distinguish between low-level (physical) AnsiChar 
values, and *logical* characters possibly consisting of multiple AnsiChars.

> In section "Dynamic code page":
> 
> "When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or 
> ShortString, the string data will however be converted to 
> DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) 
> will then be the current value of DefaultSystemCodePage (e.g. 1250 for 
> the Windows-1250 code page), even though its static code page is CP_ACP 
> (which is a constant <> 1250). This is one example of how the static 
> code page can differ from the dynamic code page. Subsequent sections 
> will describe more such scenarios."
> 
> 1) A short String does not have a Code page notification so for this 
> "static code page can differ from the dynamic code page" does not seem 
> to make much sense.

The text correctly states "dynamic code page of that AnsiString". 
ShortString (and AnsiChar) has no encoding indicator, they are assumed 
to be encoded in CP_ACP.

> 2) I fail to understand how with this explanation that seems to force 
> auto conversion for assignments between types with different "code page" 
> settings (also for CP_ACP) the "static code page can differ from the 
> dynamic code page" can happen.

Continue reading until you understood the special handling of string 
literals and RawByteString.

> In fact this disaster seems to be able to happen (see section 
> "RawByteString") if assigning a string with a static code page X1 to a 
> RawByteString (hence no conversion) and then assigning that 
> RawByteString to a string with a static code page X2 (no conversion 
> again). In fact I assume that without abusing RawByteString such 
> "intersexual" strings can't be produced, otherwise this would be rather 
> disastrous for normal users.

*All* intermediate strings, generated during the evaluation of string 
expressions, only have a dynamic encoding, thus can be considered as 
being RawByteStrings.

That's why I wonder *when* exactly the result of such an expression *is* 
converted (implicitly) into the static encoding of the target variable, 
and when *not*.

Obviously the compiler inserts an conversion request for the *direct* 
assignment of one string variable to another one, of an different 
*static* encoding. But what happens when a string expression doesn't 
have such a known static encoding???

> In section "RawByteString":
> 
> "the results of conversions from/to the CP_NONE code page are undefined."
> 
> In effect the behavior is exactly defined in this section "As a first 
> approximation".

Right, the result *is* well defined, but has no *predetermined* dynamic 
encoding.

The entire mess results from the bad interpretation of RawByteString 
assignments, which IMO was well thought by the Delphi language 
architects, but not understood by the Delphi compiler coders. This 
interpretation also found its way into FPC:

"Less intuitive is probably that when a RawByteString is assigned to an 
AnsiString(X), the same happens: no code page conversion[...]"

It's clear that a conversion *can* be omitted for every assignment *to* 
an RawByteString. That's one of the purposes of that type - to avoid 
excess conversions into CP_ACP or UnicodeString.

But it's unclear why the heck the assignment to any *other* AnsiString 
type should be omitted, as soon as the source string is a RawByteString???

Therefore I'd suggest an compiler switch, implementing the lame Delphi 
compatible behaviour only on *demand*, while the FPC default would force 
eventual conversions with *every* assignment to any other (non-CP_NONE) 
AnsiString type. This simple change will safely prevent strings of 
different static and dynamic encoding, so that according tests can be 
removed safely from library *and* user code.

The proper use of RawByteStrings deserves further documentation, for 
users who want/need their own (generic) stringhandling routines. Topics 
should be:
- how to determine the dynamic encoding of strings (StringCodePage)
- how to force required conversions (SetCodePage)
- how to deal with strings of different encodings
- how to minimize the number of string conversions

DoDi