[fpc-pascal] RTL and Unicode Strings

Wed May 11 11:43:20 CEST 2016

In our previous episode, Graeme Geldenhuys said:
> > In other cases, like LacaK said, you will have to read the data as plain 
> > bytes into e.g. a RawByteString and next use 
> > http://www.freepascal.org/docs-html/rtl/system/setcodepage.html (with 
> > the last parameter set to "false") to afterwards specify the code page 
> > this data has.
> 
> But this is where I'm getting a bit confused too.
> 
> The RTL and FCL uses String data type predominantly.
>   eg: TField.AsString: String.

String is not a type, but an alias, that is key. So any definition is as how
string is defined when it was compiled. (short/ansi/unicodestring)

> The RTL and FCL uses String (AnsiString) with default encoding set to Auto.

To the default encoding, which is the only runtime variable one, and the
base type that is used as.  So in Orwellian speak ansistring(0) is more
equal then the other ansistring()'s.

> In my application I enable unicodestring mode. So I'm reading data from
> a Firebird database. The data is stored as UTF-8 in a VarChar field. The
> DB connection is set up as UTF-8.  Now lets assume my FreeBSD box is set
> up with a default encoding of Latin-1.
> 
> So I read the UTF-8 data from the database, somewhere inside the SqlDB
> code it gets assigned to a TField's String property. ie: UTF-8 ->
> Latin-1 conversion.

Then it is basically equal to 2.6.x, and old Delphi. You are on your own and
must handle conversions yourself and be careful to not mutilate your utf8
content.

> Then I read the field value into my application. ie: Latin-1 -> UTF-16

Yes, you must also handle that conversion manually (either by moving the
character dat to an utf8 typed string and then assigning, or by a manual
encoding routine that basically takes an adress and disregards the codepage
info)

> The problem as I see it, is that I already lost data when SqlDB
> converted it to Latin-1. Am I not understanding the problem?

It depends. Sqldb assigned non ansistring data to an ansistring. In the old
(2.6.4, old delphi) logic it would simply move without conversion, and you
would obtain an ansistring with utf8 in it and be converting forever.

Nothing changed there, except your expectations :-)

> I checked the FPC 3.x db.pas unit. It uses {$mode objfpc}{$H+} - it
> doesn't use UnicodeString and neither does in use RawByteString. So a
> text encoding conversion to AnsiString(latin-1) [based on my example] is
> going to happen, right?

Yes. As said many times before, the parts above RTL level have been kept
working, but not changed.

So basically the only viable cases are the utf16 D2009+ model. (for Windows,
but works elsewhere too) and the utf8 as default (which needs to be hacked
for systems that don't default to utf8 as one byte conversion).

Both have advantages and disadvantages (and the utf8 ones are not as big as
many people think. They confuse utf8 as dominant document encoding with
apis).

But in the end the choice is simple IMHO. One is delphi compatible, one not.
Period.