[fpc-devel] TField.AsString and Databases with UTF-8 charset
Michael Van Canneyt
michael at freepascal.org
Fri Jul 24 15:31:32 CEST 2009
On Fri, 24 Jul 2009, Graeme Geldenhuys wrote:
> Michael Van Canneyt wrote:
>>> No, not a second query, just keeping track of extra (other) meta data
>>> information which was retrieved from the first API call to Firebird.
>> My databases are HUGE, and I don't think that such a query is appropriate.
> It's got nothing to do with the size of your database. It is simply SqlDB
> that is using the wrong field to report the size of the Char(x) field
Which field should it use according to you then ?
>>> From the Kylix 3 and Delphi 7 documentation:
>> Given that neither supports UTF-8, the documentation is not really
>> relevant, I'd say.
> Lucky for you, Embarcadero now has all its Delphi help available online.
> Nothing has changed in Delphi 2009 help. Here are links to D2009's online
> help. In both cases, the Size property is referring to characters and not
> byte length.
> So SqlDB with Firebird is in fact wrong when it returns Size = 8 for a
> Char(2) with UTF8 charset enabled.
Yes, but assume that a size of 2 is returned. This means a buffer of 2
bytes (in ansistring byte=character) will be reserved for the data.
What happens if some strange unicode string of 4 or even 8 bytes is
returned by Firebird ? A Buffer overflow...
So SQLDB "agrees with firebird" and reserves 8 bytes because that is the
max what can be returned. The problem being that if an ansistring is
requested, 1 char = 1 byte...
Hence my idea that TStringField is not suitable, but TUTF8Field is needed,
where size 2 is reported as it should, but a buffer of 8 is reserved for
the string. Or else a 'charsize' property must be introduced in TFieldDef
>> I agree that we need a solution, but I'm not convinced your solution
>> is correct or even desirable.
> Well, Firebird makes no sense regarding it's behaviour. If you have an UTF-8
> encoded string as follows:
> s := 'en'; // assume s is a UTF8 String type
> What is the length of that string? Firebird would argue that it's 8 bytes.
> But the Unicode organisation says it's 2 bytes - I tend to agree. The ASCII
> character set is represented in UTF-8 and they work as they did in ASCII.
> They also only take up 1 byte per character.
> Firebird now tells me that the content of the variable s is now equivalent to
> "en " when read back from the DB, but it is definitely not the case.
> "en" in ASCII or UTF-8 is still only "en" without the rubbish padding!
But the padding is added by SQLDB, not by firebird.
> PostgreSQL also supports the UTF-8 character set in databases. Surprise,
> surprise TParam.Size and TField.Size report the value of Char(x). Also the
> return values read from the Char(x) field don't contain any space padding on
> the right unless the actual text is less than the Char(x) definition. Also
> the character length NEVER exceeds the Char(x) definition.
> I'll report this issue to the Firebird developers as well. Whoever
> implemented the UTF-8 support in Firebird was a total idiot, and knew nothing
> about Unicode.
> But in the mean time we can fix the SqlDB issue and work around the Firebird
> Char(x) issue as I explained before.
The problem is deeper than you see, and is not related to SQLDb, but to the
implicit assumption in TBufDataset that for TStringField, 1 char = 1 byte:
The situation is that TBufDataset gets a fielddef for a string field, with
size N. So it reserves a buffer of N bytes, because 1 char = 1 byte.
This assumption is not correct for UTF-8 encoded strings.
As a consequence, my prediction is that, because it reports a size in characters,
the postgres implementation will suffer of buffer overflows as soon as strange
(=multibyte) unicode characters are returned.
Whereas the firebird implementation will not buffer overflow since it returns a
size in bytes (but instead it suffers from the problem you encountered).
We need to foresee the possibility where the size in characters is
different from the size in bytes. If I am correct, the TWideStringField
accomodates for this, it uses a character size of 2.
>> Don't forget also that for unicode, the number of characters differs
>> from the number of bytes. The Firebird API predates this, and so does
> Firebird was a total rewrite in C++ for v1.5 or v2 (I can't remember exactly
> which). That was pretty recent, so there is no excuse like legacy code for
> such crappy Unicode support.
Well, I'm not versed enough in Unicode to comment on their implementation.
I do know that they went to extreme lengths to be backwards compatible,
which probably accounts for the weird behaviour.
I would like Joost to comment on the whole issue, he is best placed to
discuss a possible remedy for this problem:
1. Foresee a charsize field in TStringField
2. Map UTF8 fields to widestring and convert them on the go.
3. Introduce a TUTF8field which equals TStringField, but which reserves
a buffer of 4 bytes per character.
At that point, the firebird connection must be adapted to report the
character length instead of the buffer length.
More information about the fpc-devel