[fpc-devel] Unicode support (yet again)

Thu Sep 15 19:09:08 CEST 2011

Graeme Geldenhuys schrieb:
> On 14/09/2011 17:02, Hans-Peter Diettrich wrote:
>> Many  users still  want simple  string handling,  with direct  mapping
>> between logical and physical chars (SBCS). This is not possible at all
>> with UTF-8, while UTF-16 works fine with the BMP, at least.
> 
> What rubbish! The only "utf-8 limit" is  that the current FPC and Delphi
> RTL's  don't cater  for it  due  to the  legacy ANSI  support that  came
> before.

What data type would you use, to store an UTF-8 character?
And how to access the n-th character in an UTF-8 string?
...

>> (platform  dependent) RTL  conventions,  but it  affects the  standard
>> components (string lists...)  in the FCL, and the  other components in
>> the LCL.
> 
> Please give a concrete example  where using platfrom dependent encodings
> (eg: UnicodeString  =  UTF-8  on  Linux, but  UTF-16  on  Windows)  will
> cause  problems? I really  cannot see  any issues  here, only  positives
> like  better  performance   for  each  platform  due  to   no  need  for
> auto-conversions.

As already pointed out, string encoding conversions between application 
and widgets are rare, consequently performance depends more on string 
handling in application code. Now the new Delphi string types, with 
automatic conversion when required, can cause a slowdown. In FPC 
character-based access to strings also can cause a slowdown (iterators...).

When a multi-platform application must be aware of possible UTF-8 
strings, depending on the platform, the code must be MBCS aware. This 
again is complicated string handling, when otherwise immediate indexed 
access is possible :-(

>> Here again  the average user  will prefer UTF-16  component libraries,
>> compatible  with his  own code,  while more  experienced users  may be
>> happier with the current UTF-8 libraries.
> 
> What the  hell has "experience"  got to  do with the  preference between
> UTF-8  and UTF-16? To  the developer  (and more  so to  the end-user)  a
> Unicode string should  act like any other  Unicode string. What encoding
> is used to represent "hello world" shouldn't even come into question.

This applies only to constant string literals, where the user never has 
to care for string encoding and conversion.

>> English (ASCII)  users also may prefer  UTF-8, as long as  they do not
>> have to (or want to) deal with strings in foreign languages.
> 
> Rubbish  once again! Our  applications  use UTF-8,  I  have no  problems
> writing application that support multiple  foreign language - as long as
> those  languages are  left-to-right (I  don't understand  RTL languages,
> so  can't  comment).

You better should understand ;-)

RTL is a mere *display* feature, the chars still are stored from first 
to last. More important is the SBCS/MBCS difference, which must be 
reflected in user code. Even if *you* have no problems with MBCS (like 
UTF-8), other users have.

DoDi