[fpc-devel] RFC: proper interpretation and implementation of Unicode Support

Hans-Peter Diettrich DrDiettrich1 at aol.com
Fri Nov 28 20:19:45 CET 2014


In response to another thread (this should start an new thread):

>> "CP_NONE: this value indicates that no code page information has been
>> associated with the string data. The result of any explicit or implicit
>> operation that converts this data to another code page is undefined."

After rereading I found this definition incorrect, the entire section 
(and more) deserves a correction/clarification. The implementation may 
have to be changed accordingly.

This is my interpretation of the Delphi API around encoded AnsiStrings, 
as documented and implemented there, with added clarifications and notes 
on omissions and possible problems on non-Windows platforms.

I do not expect that the FPC developers fully agree with this 
interpretation, but I expect that all items of a revised version of the 
following draft become part of the FPC documentation, somehow.

<Draft>

1) CP_ACP, CP_OEM and CP_NONE are "generic" encodings (placeholders), 
applicable as *static* string encodings inside a program only, they 
never can denote a dynamic string encoding.

Note: "codepage" here means byte-based ANSI/ISO codepages, applicable to 
AnsiStrings, not Unicode codepages (BMP...). While CP_UTF16 (and BE/LE 
variations) can be used to specify a concrete (string,textfile...) 
*encoding*, they do not describe codepages (neither Ansi nor Unicode).

Note: these identifiers (names) should be used with exreme care in 
documentation/discussions. In most cases CP_ACP stands for the *actual* 
default encoding, equivalent to the value of a hypothetical *variable* 
named CP_ACP, i.e. currently (see below) should be understood as 
DefaultSystemCodePage. It should be made clear that the value of the 
CP_ACP *constant* identifier (=0) is meant and usable only in few cases, 
like in the declaration of an string type; it may also be acceptable in 
explicit conversion requests, and to denote the encoding to use in 
file/stream I/O, where the functions replace CP_ACP by the actual 
(DefaultSystemCodePage) value internally.

Note: in compiler, library and application code a value of CP_ACP should 
be considered equal to (be mapped into) the actual 
(DefaultSystemCodePage) encoding.

2) A platform (or Unicode library) may or may not provide their own 
*generic* values (constants) for application (CP_ACP) and console 
(CP_OEM) encoding, as well as further constants for e.g. filenames.

Note: CP_ACP is zero on Windows, possibly different on other platforms 
or libraries. Thus AnsiString(0) may be different from 
AnsiString(CP_ACP). It may be required to distinguish between a named 
Pascal constant CP_ACP=0, and the value of the generic 
application/default encoding in API calls (CP_SYS?).

3) The *actually* associated codepages are defined by the platform, 
eventually can be changed by the user (admin). A program may or may not 
be allowed to change the associated codepages, either locally (process 
wide) or globally (system wide).

Note: the name "DefaultSystemCodePage" should be reserved for the 
*system* defined codepage. When this setting can be different from an 
application-wide setting, another DefaultApplicationCodePage variable 
should be added. See the comments on Modifications and Notes on 
DefaultSystemCodePage in the Wiki page!

Note: a process should determine (retrieve) the platform settings 
*before* any attempt to interpret system-provided strings (commandline, 
environment variables...). Depending on the platform, more generic 
settings may apply to specific strings, like for filenames. In all 
external API calls, the RTL is responsible for the correct encoding of 
all string arguments, as expected by the called function. This applies 
in detail to CP_ACP, when this encoding can be changed inside a program 
to something different from the external (platform...) setting.

4) A RawByteString variable, of the static encoding CP_NONE, can hold 
strings of *any* dynamic encoding. No conversion is performed when a 
string is assigned to such a variable. In the opposite direction the 
standard handling should apply, i.e. different static encodings require 
a conversion into the static target encoding.

Note: Its known that Delphi does not always convert an RawByteString, in 
an assignment to a variable of an different type. This flaw should be 
fixed in FPC. Is the according Delphi behaviour *defined* anywhere?

5) Use StringCodePage to get an actual (dynamic) string encoding. 
StringCodePage never returns one of the generic values. The dynamic 
codepage of an unassigned (empty) string is assumed (by Delphi) as the 
actually selected CP_ACP codepage for AnsiString arguments, CP_UTF16 (or 
whatever applicable) for UnicodeString arguments.

Note: while an unassigned (empty) string variable has a static encoding, 
known to the compiler, this encoding is unknown to StringCodePage. The 
overloaded Ansi/Unicode versions of StringCodePage only know about the 
basic string type (Ansi/Unicode) of their arguments, but cannot 
determine a static encoding from the inexistent string header. That's 
why in this case they return the according default encoding, as assumed 
in default type declarations, where AnsiString becomes AnsiString(CP_ACP).

Note: The Unicode overload is questionable, since in contrast to its 
name it returns an *encoding*, not a *codepage*. It should return the 
*native* (CPU specific BE/LE) UTF-16 encoding, used for strings declared 
as UnicodeString.
[Actually I cannot check the applicable Delphi constants and behaviour 
on non-Intel platforms]

</Draft>



>> IMO the result is well defined: it's the string with the encoding of
>> that "other" codepage.
> 
> Unless you actually tested this on all platforms and noted that is the
> case, you cannot state this. And if you would actually test it, you
> would discover that it is wrong
> (http://bugs.freepascal.org/view.php?id=22501#c61238 ).

In that discussion I found several errors, which are not detected by the 
compiler nor handled in the RTL. In the concrete entry the illegal use 
of the *generic* CP_NONE identifier is mentioned. That's why I felt a 
need to address several specific topics in above draft.

DoDi




More information about the fpc-devel mailing list