[fpc-pascal] RTL and Unicode Strings
Jonas Maebe
jonas.maebe at elis.ugent.be
Wed May 11 11:37:33 CEST 2016
Graeme Geldenhuys wrote on Wed, 11 May 2016:
> In my application I enable unicodestring mode. So I'm reading data from
> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
> DB connection is set up as UTF-8. Now lets assume my FreeBSD box is set
> up with a default encoding of Latin-1.
>
> So I read the UTF-8 data from the database, somewhere inside the SqlDB
> code it gets assigned to a TField's String property. ie: UTF-8 ->
> Latin-1 conversion.
This depends on how sqlDB is implemented, and I have absolutely no
clue about that (other than what LacaK wrote).
As mentioned at
http://wiki.freepascal.org/FPC_Unicode_support#Dynamic_code_page ,
conversions on assignment only happen when the *declared* code page of
the target string is different from that of the source string (other
than the special case for RawByteString). So if sqlDB only uses plain
String with {$h+} and/or AnsiString, then no conversions will happen
anywhere in the scenario you describe since it will just assign
ansistrings with declared code page CP_ACP to each other.
> Then I read the field value into my application. ie: Latin-1 -> UTF-16
If sqlDB correctly sets the dynamic codepage of the strings it creates
via SetCodePage(x,CP_UTF8,false), then when you assign those strings
with declared codepage = CP_ACP and dynamic code page CP_UTF8 to your
unicodestrings, they will be converted from UTF-8 to UTF-16 at that
point.
If it does not set the dynamic code page of the strings it creates to
the appropriate encoding, then you will indeed get data corruption at
this point, because the UTF-8 encoded data will be interpreted as
Latin-1 and then be "converted" to UTF-16.
For dealing with such code, which is not yet codepage-aware, by
default the situation is no worse or no better than it was in previous
FPC versions: exactly the same would happen there. However, in FPC 3.x
you can generally fix it by changing the default code page for
ansistrings using SetMultiByteConversionCodePage() to what you
know/want to be the encoding of ansistrings, like Lazarus does.
All of this is moreover completely independent of {$modeswitch
unicodestrings}, since that is just a shortcut to make String an alias
for UnicodeString in the current compilation module (and Char for
WideChar, and PChar for PWideChar).
Jonas
More information about the fpc-pascal
mailing list