[fpc-devel] Unicode support (yet again)
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Wed Sep 14 17:02:14 CEST 2011
Felipe Monteiro de Carvalho schrieb:
> On Tue, Sep 13, 2011 at 9:23 PM, Michael Van Canneyt
> <michael at freepascal.org> wrote:
>> One with unicode string, one with ansistring. They will have the same code,
>> but will be compiled twice, each time with a different compiler define to
>> decide which version it must be.
>
> Is this possible in UNIX? I can see that in Windows you can use the
> trick to use W versions which are identical except for the string type
> and drop Windows 9x support, but is this really possible for the UNIX
> syscalls? They expect UTF-8 not UTF-16 which is what UnicodeString
> uses.
A few topics:
The NT WinAPI (not 9x) *implements* everything in the Wide (UTF-16)
routines, the Ansi versions do the string *conversion* before calling
the Wide version. Unix API (most probably - dunno) has no such dual
interface with internal conversion.
The NT filesystems store names in UTF-16, while Unix filesystems store
UTF-8. This means that access to an NTFS or FAT32 drive under Unix will
require a string conversion, in the filesystem handler.
On Windows, Ansi means any (byte-char) encoding, with different
(national) codepages on every machine. This can cause trouble to Ansi
applications (using Ansi strings), when filenames do not convert
losslessly into that codepage. Unix IMO uses UTF-8 as the Ansi encoding,
eliminating possible losses, and that's why FPC also prefers UTF-8 encoding.
But let's not forget the user!
Many users still want simple string handling, with direct mapping
between logical and physical chars (SBCS). This is not possible at all
with UTF-8, while UTF-16 works fine with the BMP, at least. This "want
of simple string handling" suggests the use of UTF-16 for Unicode
strings in *user* code.
WRT the latter argument, FPC IMO should follow the Delphi implementation
of Unicode strings as UTF-16. This choice is independent from the
(platform dependent) RTL conventions, but it affects the standard
components (string lists...) in the FCL, and the other components in the
LCL. Here again the average user will prefer UTF-16 component libraries,
compatible with his own code, while more experienced users may be
happier with the current UTF-8 libraries.
English (ASCII) users also may prefer UTF-8, as long as they do not have
to (or want to) deal with strings in foreign languages. Once they have
to face the existence of non-ASCII strings in their applications, they
will most probably prefer switching to UTF-16, with few changes to their
existing codebase and coding habits(!). Really *processing* Unicode
text, with all its bells and whistles, is so complicated that it should
be left to dedicated software and libraries, while typical application
code will ignore everything beyond char level.
IMO the number of required conversions is of little importance to the
runtime behaviour of an application. File access is always expensive, so
that a single conversion into the platform specific filename
representation is not perceptible at all. The same for GUI components,
which typically store all strings twice: once for their own (and
application) use, and another copy in the widgets. Here again transfers
of strings between widgets and components are rare, with neglectable
slowdown by eventual conversions during message handling.
More important IMO is the external storage of Unicode, where I see no
reasonable way around UTF-8, considering codepage dependencies and
UTF-16 byte-order problems.
Another note: a "set of char" is quite incompatible with Unicode/UTF-16.
This should be taken into account with *every* introduction of an
Unicode string type.
DoDi
More information about the fpc-devel
mailing list