[fpc-devel] AnsiUpperCase problems

Thu Dec 4 15:29:06 CET 2014

The following console program demonstrates various problems with the new 
(encoded) AnsiStrings (FPC trunk):

program litTest2;
{.$codepage UTF8} //off for now
uses Classes,SysUtils;
var A: AnsiString;
begin
   a := 'äöü';
   //a := a+' '; //uncomment later
   WriteLn(a,'äöü');
   WriteLn(AnsiUpperCase(a),AnsiUpperCase('äöü'));
end.

The output varies depending on (at least) the file encoding and target 
platform (tested only on Windows, using Lazarus).

With an Ansi source file the last line shows as 'ÄÖÜÄÖÜ', as expected. 
The variable also shows as 'äöü', but not the literal (3 graphical 
characters). In all other (tested) cases something different is shown, 
no uppercase letters at all.

With an UTF-8 source file (with BOM) both the variable and literal show 
as 'äöü', but unfortunately never in upper case.

Adding {$codepage UTF8} requires an UTF-8 source file. That's compatible 
with Lazarus defaults, so that further tests (here) will use this 
combination. Please note that (currently) Lazarus sets or leaves 
DefaultSystemCodePage as according to the actual OS, i.e. 1252 for my 
installation, regardless of $codepage.

Now all items are shown as 'äöü', but again never in uppercase - how that?

AnsiUpperCase finally calls Win32AnsiUpperCase (on Windows), declared as
   function Win32AnsiUpperCase(const s: string): string;
which in turn calls CharUpperBuffA.
This explains why no uppercase conversion is performed, when S has a 
dynamic encoding different from (WinAPI) CP_ACP, which is expected by 
CharUpperBuffA. Actually I found the *dynamic* encoding of A and S as 
CP_UTF8, even if its static encoding is CP_ACP (or 1252).

Consequently AnsiUpperCase should convert S to the WinAPI CP_ACP 
(GetACP), before passing it to CharUpperBuffA. The same for all other 
functions with AnsiString arguments, calling external (OS API...) 
routines expecting a specific encoding, on all platforms. And for user 
code, which relies on the encoding of all strings being the declared 
one, like in:
   str1[1]:=str2[1]; //both strings of same type

IMO such additional checks and conversions should be avoided, they bloat 
the library code and consume runtime. Note that SetCodePage requires an 
RawByteString (var parameter), and thus cannot be used immediately to 
adjust the dynamic codepage of an AnsiString.

Now let's add (uncomment) the line
   a := a+' ';
and voila, AnsiUpperCase works, because now the string has the expected 
CP_ACP instead of UTF-8. The same effect occurs when A is assigned from 
an UnicodeString variable.

Is it really intended, that AnsiString behaviour depends on such details?

The most simple solution would disallow a different static and dynamic 
encoding of AnsiStrings, except for RawByteString. Then no additional 
checks and conversions are required, except the one in the assignment of 
an RawByteString to an AnsiString of different type, and everything else 
can be determined by the compiler from the known static=dynamic encoding 
of strings.

More checks and conversions can be avoided, when the dynamic encoding of 
string literals is the actual encoding, as used by the compiler for the 
stored literal, not Delphi incompatible placeholders like CP_ACP. Then 
TranslatePlaceholderCP is required only for explicitly given encoding 
values, but no more for the dynamic encoding of strings.

DoDi