[fpc-pascal] RTL and Unicode Strings

Wed May 11 11:37:33 CEST 2016

Graeme Geldenhuys wrote on Wed, 11 May 2016:

> In my application I enable unicodestring mode. So I'm reading data from
> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
> up with a default encoding of Latin-1.
>
> So I read the UTF-8 data from the database, somewhere inside the SqlDB
> code it gets assigned to a TField's String property. ie: UTF-8 ->
> Latin-1 conversion.

This depends on how sqlDB is implemented, and I have absolutely no  
clue about that (other than what LacaK wrote).

As mentioned at  
http://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page ,  
conversions on assignment only happen when the *declared* code page of  
the target string is different from that of the source string (other  
than the special case for RawByteString). So if sqlDB only uses plain  
String with {$h+} and/or AnsiString, then no conversions will happen  
anywhere in the scenario you describe since it will just assign  
ansistrings with declared code page CP_ACP to each other.

> Then I read the field value into my application. ie: Latin-1 -> UTF-16

If sqlDB correctly sets the dynamic codepage of the strings it creates  
via SetCodePage(x,CP_UTF8,false), then when you assign those strings  
with declared codepage = CP_ACP and dynamic code page CP_UTF8 to your  
unicodestrings, they will be converted from UTF-8 to UTF-16 at that  
point.

If it does not set the dynamic code page of the strings it creates to  
the appropriate encoding, then you will indeed get data corruption at  
this point, because the UTF-8 encoded data will be interpreted as  
Latin-1 and then be "converted" to UTF-16.

For dealing with such code, which is not yet codepage-aware, by  
default the situation is no worse or no better than it was in previous  
FPC versions: exactly the same would happen there. However, in FPC 3.x  
you can generally fix it by changing the default code page for  
ansistrings using SetMultiByteConversionCodePage() to what you  
know/want to be the encoding of ansistrings, like Lazarus does.

All of this is moreover completely independent of {$modeswitch  
unicodestrings}, since that is just a shortcut to make String an alias  
for UnicodeString in the current compilation module (and Char for  
WideChar, and PChar for PWideChar).

Jonas