[fpc-devel] TField.AsString and Databases with UTF-8 charset

Graeme Geldenhuys graemeg at opensoft.homeip.net
Fri Jul 24 14:06:21 CEST 2009


Michael Van Canneyt wrote:
>> No, not a second query, just keeping track of extra (other) meta data 
>> information which was retrieved from the first API call to Firebird.
> 
> My databases are HUGE, and I don't think that such a query is appropriate.


It's got nothing to do with the size of your database. It is simply 
SqlDB that is using the wrong field to report the size of the Char(x) 
field definitions.


>>
>> From the Kylix 3 and Delphi 7 documentation:
> 
> Given that neither supports UTF-8, the documentation is not really
> relevant, I'd say.


Lucky for you, Embarcadero now has all its Delphi help available online. 
Nothing has changed in Delphi 2009 help. Here are links to D2009's 
online help. In both cases, the Size property is referring to characters 
and not byte length.

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/DB_TParam_Size.html
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/DB_TField_Size.html

So SqlDB with Firebird is in fact wrong when it returns Size = 8 for a 
Char(2) with UTF8 charset enabled.


> I agree that we need a solution, but I'm not convinced your solution
> is correct or even desirable.


Well, Firebird makes no sense regarding it's behaviour. If you have an 
UTF-8 encoded string as follows:

  s := 'en';   // assume s is a UTF8 String type

What is the length of that string? Firebird would argue that it's 8 
bytes. But the Unicode organisation says it's 2 bytes - I tend to agree. 
The ASCII character set is represented in UTF-8 and they work as they 
did in ASCII. They also only take up 1 byte per character.

Firebird now tells me that the content of the variable s is now 
equivalent to "en      " when read back from the DB, but it is 
definitely not the case.  "en" in ASCII or UTF-8 is still only "en" 
without the rubbish padding!

PostgreSQL also supports the UTF-8 character set in databases. Surprise, 
surprise TParam.Size and TField.Size report the value of Char(x). Also 
the return values read from the Char(x) field don't contain any space 
padding on the right unless the actual text is less than the Char(x) 
definition. Also the character length NEVER exceeds the Char(x) definition.

I'll report this issue to the Firebird developers as well. Whoever 
implemented the UTF-8 support in Firebird was a total idiot, and knew 
nothing about Unicode.

But in the mean time we can fix the SqlDB issue and work around the 
Firebird Char(x) issue as I explained before.


> Don't forget also that for unicode, the number of characters differs
> from the number of bytes. The Firebird API predates this, and so does

Firebird was a total rewrite in C++ for v1.5 or v2 (I can't remember 
exactly which). That was pretty recent, so there is no excuse like 
legacy code for such crappy Unicode support.


Regards,
   - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://opensoft.homeip.net/fpgui/




More information about the fpc-devel mailing list