[fpc-devel] TStringField, String and UnicodeString and UTF8String

Thu Jan 13 14:58:30 CET 2011

On Thu, 2011-01-13 at 09:15 +0100, LacaK wrote:
> 
> > Didn't I explain this to you and others a few times?
> >   
> ;-) If so, then please excuse me
> 
> > The database-components itself are encoding-agnostic. This means:
> > encoding in = encoding out.
> > 
> > So it is up to the developer what codepage he want to use. So
> > TField.Text can have the encoding _you_ want.
> > 
> > So, if you want to work with Lazarus, which uses UTF-8, you have to use
> > UTF-8 encoded strings in your database. 
> >   
> So this is answer, which i have looked for:
> "In Lazarus TStringField MUST hold UTF-8 encoded strings."

Not entirely true. You could also choose to bind the fields to some
Lazarus-components manually, not using the db-components. (Tedit.Text :=
convertFunc(StringField.Text)) Or you can add a hook so that the .text
property always does a conversion to UTF-8. First option can be used if
you use a mediator or view. Second options I woudn't use.

> But I guess (I have theory), that in time, when Borland introduced
> TStringField, the design goal was:
> TStringField was designed for SBCS (because DataSize=Size+1) string
> data encoded in system ANSI code page and TWideStringField was
> designed for DBCS widestring (UTF-16) character data

You have to be really careful in what you type, when you are writing
about encodings. The above is nonsense, because of a very tiny mistake.

If you compare DBCS widestring with UTF-16, you can also compare a
stringfield with UTF-8. Exactly the same problem. (A character can be
made up from more then one UTF-8 or UTF-16 codepoint)

But TStringField's datasize by default is indeed Size+1. So if you use
it t store UTF-8, you have to define the size as four times the
field-size given by the database. Note that this is done in some cases.

> May be, that I was mistaken by this view.
> (or may be, that there is different approach in Delphi ("no agnostic")
> and different in FPC ("agnostic")?)

No, Delphi does the same. Only newer Delphi versions have a string-type
which contains the used encoding (details can be found in this thread),
so can do some conversions for you. But that has nothing to do with the
database-code. Also, you don't need it. People all over the world have
used older Delphi versions all the time... (But offcourse, it's easier
now)

> > If there is some strange reason why you don't want the strings in your
> > database to be UTF-8 encoded,
> SQL Server does not support UTF-8 (AFAIK)

Rofl. You mean that Microsoft SQL Server can't handle unicode
completely? If they say that in an advertisement they can forget that
any big commercial client will choose their product...

> SQL Server provides non-UNICODE datatypes - char, varchar, text 

ie: TStringField

>  and UNICODE (UCS-2) datatypes - nchar, nvarchar, ntext

ie: TWideStringField.

What does this have to do with your problem? Nothing. Only things what
matters is what encoding is used while communicating with the client.
(Which you can set)

> > you have to convert the strings from the
> > encoding your database uses to UTF-8 while reading data from the
> > database.
> > 
> > Luckily, you can specify the encoding of strings you want to use for
> > most databases. Not only the encoding in which the strings are stored,
> > but also the encoding which has to be used when you send and retrieve
> > data from the database. And you can set this for each connection made.
> > 
> > Ie: you can resolve the problem by changing the connection-string, or by
> > adding some connection-parameter.
> > 
> >   
> Yes, it is true for example for MySQL or Firebird ODBC driver,
>  but for SQL Server or PostgreSQL ODBC driver there are no such
> options

Then that option has to be added. I think it's already possible but you
simply don't know how. (Sql-Server is ODBC only, so that one is fixed.
For firebird there's a 'serverencoding' parameter, or something like
that. Postgres also has some setting.

>  (but PostgreSQL ODBC driver exists in ANSI and UNICODE version)

I saw that in an earlier message, but also this has nothing to do with
your problem. You only need the different calls when you want to use
UTF-8 in your fieldnames. (Or, and this one was tricky, in the
connection-string. But this was more then a year ago.)

>  SQL Server ODBC driver supports "AutoTranslate", see:
> http://msdn.microsoft.com/en-us/library/ms130822.aspx
>  "SQL Server char, varchar, or text data sent to a client SQL_C_CHAR
> variable is converted from character to Unicode using the server ACP,
> then converted from Unicode to character using the client ACP."

This is what you use when you set the encoding when you connect to the
client. The solution to all your problems. As explained three times, in
this message alone.

In fact it's simple: incoming data=outgoing data.

If you need UTF-8 encoding for the outgoing data (direct access to
Lazarus controls) you have to select UTF-8 at the input. That's always
more efficient than converting the data to/from any other encoding.

And, luckily, you can instruct the Database-server which encoding to use
when it's communicating with the outer world. So your problem is solved.

Now, if you also choose UTF-8 as the Database-server field encoding (the
encoding the data is stored in) there's no conversion necessary at all.
But that's a bonus, not a necessity.

Joost.