[fpc-pascal] RTL and Unicode Strings
Michael Van Canneyt
michael at freepascal.org
Wed May 11 11:48:10 CEST 2016
On Wed, 11 May 2016, Jonas Maebe wrote:
>
> Graeme Geldenhuys wrote on Wed, 11 May 2016:
>
>> In my application I enable unicodestring mode. So I'm reading data from
>> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
>> DB connection is set up as UTF-8. Now lets assume my FreeBSD box is set
>> up with a default encoding of Latin-1.
>>
>> So I read the UTF-8 data from the database, somewhere inside the SqlDB
>> code it gets assigned to a TField's String property. ie: UTF-8 ->
>> Latin-1 conversion.
>
> This depends on how sqlDB is implemented, and I have absolutely no clue about
> that (other than what LacaK wrote).
>
> As mentioned at
> http://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page ,
> conversions on assignment only happen when the *declared* code page of the
> target string is different from that of the source string (other than the
> special case for RawByteString). So if sqlDB only uses plain String with
> {$h+} and/or AnsiString, then no conversions will happen anywhere in the
> scenario you describe since it will just assign ansistrings with declared
> code page CP_ACP to each other.
This is the case.
>
>> Then I read the field value into my application. ie: Latin-1 -> UTF-16
>
> If sqlDB correctly sets the dynamic codepage of the strings it creates via
> SetCodePage(x,CP_UTF8,false), then when you assign those strings with
> declared codepage = CP_ACP and dynamic code page CP_UTF8 to your
> unicodestrings, they will be converted from UTF-8 to UTF-16 at that point.
It does not do this.
>
> If it does not set the dynamic code page of the strings it creates to the
> appropriate encoding, then you will indeed get data corruption at this point,
> because the UTF-8 encoded data will be interpreted as Latin-1 and then be
> "converted" to UTF-16.
That is what happens.
Currently, the ONLY provision that is made is that, if SQLDB detects somehow that the
server uses UTF8, it will use an ansistring, allocate 4 bytes in the buffers for each
character.
But it currently does not set the code page of the allocated string to UTF8.
> For dealing with such code, which is not yet codepage-aware, by default the
> situation is no worse or no better than it was in previous FPC versions:
> exactly the same would happen there. However, in FPC 3.x you can generally
> fix it by changing the default code page for ansistrings using
> SetMultiByteConversionCodePage() to what you know/want to be the encoding of
> ansistrings, like Lazarus does.
If Lazarus already sets SetMultiByteConversionCodePage, then it will wreak
havoc to set it to something else.
This matter must be decided at the TDataset level: it should have a property
to determine the character set of string fields (and possibly different for
each field, since this can differ in the database on a field basis).
>
> All of this is moreover completely independent of {$modeswitch
> unicodestrings}, since that is just a shortcut to make String an alias for
> UnicodeString in the current compilation module (and Char for WideChar, and
> PChar for PWideChar).
Honestly, I don't understand this preoccupation with {$modeswitch unicodestrings}
It just means that
Var
a : string;
is read by the compiler as
Var
a : unicodestring;
No more, no less.
Michael.
More information about the fpc-pascal
mailing list