[fpc-devel] RFC: proper interpretation and implementation of Unicode Support
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Fri Nov 28 20:19:45 CET 2014
In response to another thread (this should start an new thread):
>> "CP_NONE: this value indicates that no code page information has been
>> associated with the string data. The result of any explicit or implicit
>> operation that converts this data to another code page is undefined."
After rereading I found this definition incorrect, the entire section
(and more) deserves a correction/clarification. The implementation may
have to be changed accordingly.
This is my interpretation of the Delphi API around encoded AnsiStrings,
as documented and implemented there, with added clarifications and notes
on omissions and possible problems on non-Windows platforms.
I do not expect that the FPC developers fully agree with this
interpretation, but I expect that all items of a revised version of the
following draft become part of the FPC documentation, somehow.
<Draft>
1) CP_ACP, CP_OEM and CP_NONE are "generic" encodings (placeholders),
applicable as *static* string encodings inside a program only, they
never can denote a dynamic string encoding.
Note: "codepage" here means byte-based ANSI/ISO codepages, applicable to
AnsiStrings, not Unicode codepages (BMP...). While CP_UTF16 (and BE/LE
variations) can be used to specify a concrete (string,textfile...)
*encoding*, they do not describe codepages (neither Ansi nor Unicode).
Note: these identifiers (names) should be used with exreme care in
documentation/discussions. In most cases CP_ACP stands for the *actual*
default encoding, equivalent to the value of a hypothetical *variable*
named CP_ACP, i.e. currently (see below) should be understood as
DefaultSystemCodePage. It should be made clear that the value of the
CP_ACP *constant* identifier (=0) is meant and usable only in few cases,
like in the declaration of an string type; it may also be acceptable in
explicit conversion requests, and to denote the encoding to use in
file/stream I/O, where the functions replace CP_ACP by the actual
(DefaultSystemCodePage) value internally.
Note: in compiler, library and application code a value of CP_ACP should
be considered equal to (be mapped into) the actual
(DefaultSystemCodePage) encoding.
2) A platform (or Unicode library) may or may not provide their own
*generic* values (constants) for application (CP_ACP) and console
(CP_OEM) encoding, as well as further constants for e.g. filenames.
Note: CP_ACP is zero on Windows, possibly different on other platforms
or libraries. Thus AnsiString(0) may be different from
AnsiString(CP_ACP). It may be required to distinguish between a named
Pascal constant CP_ACP=0, and the value of the generic
application/default encoding in API calls (CP_SYS?).
3) The *actually* associated codepages are defined by the platform,
eventually can be changed by the user (admin). A program may or may not
be allowed to change the associated codepages, either locally (process
wide) or globally (system wide).
Note: the name "DefaultSystemCodePage" should be reserved for the
*system* defined codepage. When this setting can be different from an
application-wide setting, another DefaultApplicationCodePage variable
should be added. See the comments on Modifications and Notes on
DefaultSystemCodePage in the Wiki page!
Note: a process should determine (retrieve) the platform settings
*before* any attempt to interpret system-provided strings (commandline,
environment variables...). Depending on the platform, more generic
settings may apply to specific strings, like for filenames. In all
external API calls, the RTL is responsible for the correct encoding of
all string arguments, as expected by the called function. This applies
in detail to CP_ACP, when this encoding can be changed inside a program
to something different from the external (platform...) setting.
4) A RawByteString variable, of the static encoding CP_NONE, can hold
strings of *any* dynamic encoding. No conversion is performed when a
string is assigned to such a variable. In the opposite direction the
standard handling should apply, i.e. different static encodings require
a conversion into the static target encoding.
Note: Its known that Delphi does not always convert an RawByteString, in
an assignment to a variable of an different type. This flaw should be
fixed in FPC. Is the according Delphi behaviour *defined* anywhere?
5) Use StringCodePage to get an actual (dynamic) string encoding.
StringCodePage never returns one of the generic values. The dynamic
codepage of an unassigned (empty) string is assumed (by Delphi) as the
actually selected CP_ACP codepage for AnsiString arguments, CP_UTF16 (or
whatever applicable) for UnicodeString arguments.
Note: while an unassigned (empty) string variable has a static encoding,
known to the compiler, this encoding is unknown to StringCodePage. The
overloaded Ansi/Unicode versions of StringCodePage only know about the
basic string type (Ansi/Unicode) of their arguments, but cannot
determine a static encoding from the inexistent string header. That's
why in this case they return the according default encoding, as assumed
in default type declarations, where AnsiString becomes AnsiString(CP_ACP).
Note: The Unicode overload is questionable, since in contrast to its
name it returns an *encoding*, not a *codepage*. It should return the
*native* (CPU specific BE/LE) UTF-16 encoding, used for strings declared
as UnicodeString.
[Actually I cannot check the applicable Delphi constants and behaviour
on non-Intel platforms]
</Draft>
>> IMO the result is well defined: it's the string with the encoding of
>> that "other" codepage.
>
> Unless you actually tested this on all platforms and noted that is the
> case, you cannot state this. And if you would actually test it, you
> would discover that it is wrong
> (http://bugs.freepascal.org/view.php?id=22501#c61238 ).
In that discussion I found several errors, which are not detected by the
compiler nor handled in the RTL. In the concrete entry the illegal use
of the *generic* CP_NONE identifier is mentioned. That's why I felt a
need to address several specific topics in above draft.
DoDi
More information about the fpc-devel
mailing list