[fpc-devel] Unicode support (yet again)

Wed Sep 14 17:02:14 CEST 2011

Felipe Monteiro de Carvalho schrieb:
> On Tue, Sep 13, 2011 at 9:23 PM, Michael Van Canneyt
> <michael at freepascal.org> wrote:
>> One with unicode string, one with ansistring. They will have the same code,
>> but will be compiled twice, each time with a different compiler define to
>> decide which version it must be.
> 
> Is this possible in UNIX? I can see that in Windows you can use the
> trick to use W versions which are identical except for the string type
> and drop Windows 9x support, but is this really possible for the UNIX
> syscalls? They expect UTF-8 not UTF-16 which is what UnicodeString
> uses.

A few topics:

The NT WinAPI (not 9x) *implements* everything in the Wide (UTF-16) 
routines, the Ansi versions do the string *conversion* before calling 
the Wide version. Unix API (most probably - dunno) has no such dual 
interface with internal conversion.

The NT filesystems store names in UTF-16, while Unix filesystems store 
UTF-8. This means that access to an NTFS or FAT32 drive under Unix will 
require a string conversion, in the filesystem handler.

On Windows, Ansi means any (byte-char) encoding, with different 
(national) codepages on every machine. This can cause trouble to Ansi 
applications (using Ansi strings), when filenames do not convert 
losslessly into that codepage. Unix IMO uses UTF-8 as the Ansi encoding, 
eliminating possible losses, and that's why FPC also prefers UTF-8 encoding.

But let's not forget the user!

Many users still want simple string handling, with direct mapping 
between logical and physical chars (SBCS). This is not possible at all 
with UTF-8, while UTF-16 works fine with the BMP, at least. This "want 
of simple string handling" suggests the use of UTF-16 for Unicode 
strings in *user* code.

WRT the latter argument, FPC IMO should follow the Delphi implementation 
of Unicode strings as UTF-16. This choice is independent from the 
(platform dependent) RTL conventions, but it affects the standard 
components (string lists...) in the FCL, and the other components in the 
LCL. Here again the average user will prefer UTF-16 component libraries, 
compatible with his own code, while more experienced users may be 
happier with the current UTF-8 libraries.

English (ASCII) users also may prefer UTF-8, as long as they do not have 
to (or want to) deal with strings in foreign languages. Once they have 
to face the existence of non-ASCII strings in their applications, they 
will most probably prefer switching to UTF-16, with few changes to their 
existing codebase and coding habits(!). Really *processing* Unicode 
text, with all its bells and whistles, is so complicated that it should 
be left to dedicated software and libraries, while typical application 
code will ignore everything beyond char level.

IMO the number of required conversions is of little importance to the 
runtime behaviour of an application. File access is always expensive, so 
that a single conversion into the platform specific filename 
representation is not perceptible at all. The same for GUI components, 
which typically store all strings twice: once for their own (and 
application) use, and another copy in the widgets. Here again transfers 
of strings between widgets and components are rare, with neglectable 
slowdown by eventual conversions during message handling.

More important IMO is the external storage of Unicode, where I see no 
reasonable way around UTF-8, considering codepage dependencies and 
UTF-16 byte-order problems.

Another note: a "set of char" is quite incompatible with Unicode/UTF-16. 
This should be taken into account with *every* introduction of an 
Unicode string type.

DoDi