[fpc-pascal] RTL and Unicode Strings

Wed May 11 11:48:10 CEST 2016

On Wed, 11 May 2016, Jonas Maebe wrote:

>
> Graeme Geldenhuys wrote on Wed, 11 May 2016:
>
>> In my application I enable unicodestring mode. So I'm reading data from
>> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
>> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
>> up with a default encoding of Latin-1.
>> 
>> So I read the UTF-8 data from the database, somewhere inside the SqlDB
>> code it gets assigned to a TField's String property. ie: UTF-8 ->
>> Latin-1 conversion.
>
> This depends on how sqlDB is implemented, and I have absolutely no clue about 
> that (other than what LacaK wrote).
>
> As mentioned at 
> http://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page , 
> conversions on assignment only happen when the *declared* code page of the 
> target string is different from that of the source string (other than the 
> special case for RawByteString). So if sqlDB only uses plain String with 
> {$h+} and/or AnsiString, then no conversions will happen anywhere in the 
> scenario you describe since it will just assign ansistrings with declared 
> code page CP_ACP to each other.

This is the case.

>
>> Then I read the field value into my application. ie: Latin-1 -> UTF-16
>
> If sqlDB correctly sets the dynamic codepage of the strings it creates via 
> SetCodePage(x,CP_UTF8,false), then when you assign those strings with 
> declared codepage = CP_ACP and dynamic code page CP_UTF8 to your 
> unicodestrings, they will be converted from UTF-8 to UTF-16 at that point.

It does not do this.

>
> If it does not set the dynamic code page of the strings it creates to the 
> appropriate encoding, then you will indeed get data corruption at this point, 
> because the UTF-8 encoded data will be interpreted as Latin-1 and then be 
> "converted" to UTF-16.

That is what happens.

Currently, the ONLY provision that is made is that, if SQLDB detects somehow that the
server uses UTF8, it will use an ansistring, allocate 4 bytes in the buffers for each
character.

But it currently does not set the code page of the allocated string to UTF8.

> For dealing with such code, which is not yet codepage-aware, by default the 
> situation is no worse or no better than it was in previous FPC versions: 
> exactly the same would happen there. However, in FPC 3.x you can generally 
> fix it by changing the default code page for ansistrings using 
> SetMultiByteConversionCodePage() to what you know/want to be the encoding of 
> ansistrings, like Lazarus does.

If Lazarus already sets SetMultiByteConversionCodePage, then it will wreak
havoc to set it to something else.

This matter must be decided at the TDataset level: it should have a property
to determine the character set of string fields (and possibly different for
each field, since this can differ in the database on a field basis).

>
> All of this is moreover completely independent of {$modeswitch 
> unicodestrings}, since that is just a shortcut to make String an alias for 
> UnicodeString in the current compilation module (and Char for WideChar, and 
> PChar for PWideChar).

Honestly, I don't understand this preoccupation with {$modeswitch  unicodestrings}

It just means that

Var
  a : string;

is read by the compiler as

Var
  a : unicodestring;

No more, no less.

Michael.