[fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

Sun Sep 28 12:22:17 CEST 2008

On Sun, 28 Sep 2008 09:23:14 +0200
Martin Schreiber <fpmse at bluewin.ch> wrote:

> On Sunday 28 September 2008 00.10:43 Graeme Geldenhuys wrote:
> > On Fri, Sep 26, 2008 at 5:02 PM, Mattias Gaertner
> >
> > <nc-gaertnma at netcologne.de> wrote:
> > > s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.
> > >
> > > In short:
> > > A single character for all purposes can not be defined. Unicode
> > > can not be handled as array of character.
> >
> > This is what I thought, but everybody seems to side step the answer.
> > Thanks Mattias for confirming this. Like I told Martin in one of my
> > replies. In the last four years I have not needed indexing into a
> > character array, and if I have to parse a string, it's normally
> > sequential anyway, which is then easy to track each charter in
> > UTF-8, even if multi-byte characters are used.
> >
> >
> Note that UTF8CharAtByte() won't work work in Mattias example neither.
> It seems that Apple decided to use two characters from the BMP to
> denote umlauts. Example for ä (U+00E4 LATIN SMALL LETTER A WITH
> DIARESIS): a (U+0061 LATIN SMALL LETTER A) followed by ¨ (U+0308,
> COMBINIG DIARESIS). Mattias please correct me if I am wrong.

You are right. (I didn't check the exact values.)

> So the problem is not that the characters don't fit in the UCS2
> range, the problem is that Apple use the decomposed forms of umlauts.

Well, in case of a-umlaut you are right. But not in general. It
only means, that you can not use UCS2 or whatever directly. You must
convert. And the conversion can not be done trivially with some
s[i]:='x'.
Do you think Apple is so stupid to use the decomposed form, if the
composed form is equivalent?

> If you work with OS X HFS you must convert to the composed normal
> form if fpGUI uses the composed form internally before processing the
> filenames in fpGUI. This is independent of using utf-8, utf-16,
> utf-32 or UCS2. You need conversion tables to do so and again, it is
> easier to handle with widestrings instead of utf-8 strings if you
> don't need characters which don't fit into BMP. And even if you want
> to support the full Unicode code point range it is simpler with
> utf-16 because there are surrogate *pairs* only.

HFS+ uses something similar to NFD, with some differences for
historical reasons. It is recommended to *not* convert on your own and
use the apple functions. They support UTF-8, the various UTF-16
encodings and some more.

> In MSEgui I would implement the normalization into the MSEgui
> filename routines, MSEgui uses a normalized cross platform filename
> scheme anyway. 

You can not normalize the composed and decomposed state platform
independently. For example Linux ext3 does not normalize in any
way and therefore distinguish between composed a-umlaut and decomposed
a-umlaut. You can even use invalid UTF-8 sequences.

> Win32 'c:\aaaa\bbb.ext' will be normalized to MSEgui
> form '/c:/aaaa/bbb.ext', Unicode composed normalization can be done
> in the same step.

Is this normalized form used only internally in msegui or must the user
use them too?

> An article about Unicode normalization:
> 
> http://en.wikipedia.org/wiki/Unicode_normalization

Thanks.
Unicode is really a zoo. The page shows that the encoding is the least
problem of unicode.

Mattias