[fpc-devel] AnsiUpperCase problems
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Thu Dec 4 15:29:06 CET 2014
The following console program demonstrates various problems with the new
(encoded) AnsiStrings (FPC trunk):
program litTest2;
{.$codepage UTF8} //off for now
uses Classes,SysUtils;
var A: AnsiString;
begin
a := 'äöü';
//a := a+' '; //uncomment later
WriteLn(a,'äöü');
WriteLn(AnsiUpperCase(a),AnsiUpperCase('äöü'));
end.
The output varies depending on (at least) the file encoding and target
platform (tested only on Windows, using Lazarus).
With an Ansi source file the last line shows as 'ÄÖÜÄÖÜ', as expected.
The variable also shows as 'äöü', but not the literal (3 graphical
characters). In all other (tested) cases something different is shown,
no uppercase letters at all.
With an UTF-8 source file (with BOM) both the variable and literal show
as 'äöü', but unfortunately never in upper case.
Adding {$codepage UTF8} requires an UTF-8 source file. That's compatible
with Lazarus defaults, so that further tests (here) will use this
combination. Please note that (currently) Lazarus sets or leaves
DefaultSystemCodePage as according to the actual OS, i.e. 1252 for my
installation, regardless of $codepage.
Now all items are shown as 'äöü', but again never in uppercase - how that?
AnsiUpperCase finally calls Win32AnsiUpperCase (on Windows), declared as
function Win32AnsiUpperCase(const s: string): string;
which in turn calls CharUpperBuffA.
This explains why no uppercase conversion is performed, when S has a
dynamic encoding different from (WinAPI) CP_ACP, which is expected by
CharUpperBuffA. Actually I found the *dynamic* encoding of A and S as
CP_UTF8, even if its static encoding is CP_ACP (or 1252).
Consequently AnsiUpperCase should convert S to the WinAPI CP_ACP
(GetACP), before passing it to CharUpperBuffA. The same for all other
functions with AnsiString arguments, calling external (OS API...)
routines expecting a specific encoding, on all platforms. And for user
code, which relies on the encoding of all strings being the declared
one, like in:
str1[1]:=str2[1]; //both strings of same type
IMO such additional checks and conversions should be avoided, they bloat
the library code and consume runtime. Note that SetCodePage requires an
RawByteString (var parameter), and thus cannot be used immediately to
adjust the dynamic codepage of an AnsiString.
Now let's add (uncomment) the line
a := a+' ';
and voila, AnsiUpperCase works, because now the string has the expected
CP_ACP instead of UTF-8. The same effect occurs when A is assigned from
an UnicodeString variable.
Is it really intended, that AnsiString behaviour depends on such details?
The most simple solution would disallow a different static and dynamic
encoding of AnsiStrings, except for RawByteString. Then no additional
checks and conversions are required, except the one in the assignment of
an RawByteString to an AnsiString of different type, and everything else
can be determined by the compiler from the known static=dynamic encoding
of strings.
More checks and conversions can be avoided, when the dynamic encoding of
string literals is the actual encoding, as used by the compiler for the
stored literal, not Delphi incompatible placeholders like CP_ACP. Then
TranslatePlaceholderCP is required only for explicitly given encoding
values, but no more for the dynamic encoding of strings.
DoDi
More information about the fpc-devel
mailing list