[fpc-devel] Unicode in the RTL (my ideas)

Thu Aug 23 18:51:02 CEST 2012

Daniël Mantione schrieb:

>>> * There are no whitespace characters beyond widechar range. This 
>>> means you
>>>   can write a routine to split a string into words without bothing about
>>>   surrogate pairs and remain fully UTF-16 compliant.
>>
>> How is this different for UTF-8?
> 
> Your answer exactly demonstrates how UTF-16 can result in better Unicode 
> support: You probably consider the space the only white-space character 
> and would have written code that only handles the space. In Unicode you 
> have the space, the non-breaking space, the half-space and probably a 
> few more that I am missing.

IOW: All this is independent from UTF-8/16.

>>> * There are no characters with uppper/lowercase beyond widechar range.
>>>   That means if you write cade that deals with character case you don't
>>>   need to bother with surrogate pairs and still remain fully UTF-16
>>>   complaint.
>>
>> How expensive is a Unicode Upper/LowerCase conversion per se?
> 
> I'd expect a conversion would be quite a bit faster in UTF-16, as can be 
> a table lookup per character rather than a decode/re-encode per character.

UTF-8 decoding allows to use nested lookup tables, so that the search
(eventually) can become faster than for UTF-16.

> But it's not about conversion per se, everyday code deals with character 
> case in a lot more situations.

I wanted to separate the two possibly time consuming parts, i.e. the
lookup of the upper/lower characters from placing the characters into
strings. Like in the whithespace case above, the Unicode lookup
functions can become the bottlenecks when the strings can be updated
in-place.

>>> * You can group Korean letters into Korean syllables, again without
>>>   bothering about surrogate pairs, as Korean is one of the many 
>>> languages
>>>   that is entirely in widechar range.
>>
>> The same applies to English and UTF-8 ;-)
>> Selected languages can be handled in special ways, but not all.
> 
> I'd disagree, because there are quite a few codepoints that can be used 
> for English texts beyond #128, like i.e. currency symbols, or ligatures,

How is that related to the encoding?

> but suppose I'd follow your reasoning, the list of languages your 
> Unicode aware software will handle properly is:
> 
> * English

No, it's German.

Almost every program handles only one language, the language of the
implementor or his boss.

Breaking an text into words works the same for all character-based
languages, in all encodings. The rest (Chinese...) does not normally
deserve such a function, when every codepoint already represents a word.

> If are interrested in proper multi-lingual support... you won't get very 
> far. In UTF-16 only few of the 6000 languages in the world need 
> codepoints beyond the basic multi-lingual plane. In other words you get 
> very far.

Language specifics enter the scene only when working on lower level
(syllables, characters), and nobody should do that without knowledge of
the concrete language. It's not a matter of Unicode at all, and not
necessarily a matter of string encodings. E.g. all languages based on
latin characters (with adds like accents) require less memory in UTF-8
than in UTF-16 encoding, and thus the n in O(n) is lower than for
operations on UTF-16 encoded strings.

>> You mentioned Korean syllables splitting - is this a task occuring 
>> often in Korean programs?
> 
> Yes, in Korean this is very important, because Korean letters are 
> written in syllables, so it's a very common conversion. There are both 
> Unicode points for letters and for syllables.

Then I don't understand why you want to break an text into syllables,
when it's know that each codepoint represents a syllable?

> For example people when people type letters on the keyboard, you receive 
> the letter unicode points. If you send those directly to the screen you 
> see the individual letters; that's not correct Korean writing, you want 
> to convert to syllables and send the Unicode points for syllables to the 
> screen.

That's the task of an IME (on Windows), not of a program, and it works
faster for any encoding than the user can press keys. The mapping into
syllables also should be done in a library function, not in user written
code.

>> At the begin of computer-based publishing most German texts were hard 
>> to read, due to many wordbreak errors.
> 
> In western-languages, syllables are only important for word-breaks and 
> our publishing software contains advanced syllable splitting algorithms. 
> You'd better not use that code for Korean texts, because there exists no 
> need to break words in that script.
> 
> In general... different language, different text processing algorithms...

Exactly my point :-)

Dealing with text is different from dealing with strings, and requires
language specific libraries. Then the string encoding is quite
unimportant to the user.

>> But another point becomes *really* important, when libraries with 
>> beforementioned Unicode functions are used: The application and 
>> libraries should use the *same* string encoding, to prevent frequent 
>> conversions with every function call. This suggests to use the 
>> library(=platform) specific string encoding, which can be different on 
>> e.g. Windows and Linux.
>>
>> Consequently a cross-platform program should be as insensitive as 
>> possible to encodings, and the whole UTF-8/16 discussion turns out to 
>> be purely academic. This leads again to an different issue: should we 
>> declare an string type dedicated to Unicode text processing, which can 
>> vary depending on the platform/library encoding? Then everybody can 
>> decide whether to use one string type (RTL/FCL/LCL compatible) for 
>> general tasks, and the library compatible type for text processing?
> 
> No disagreement here, if all your libraries are UTF-8, you don't want to 
> convert everything. So if possible, write code to be as string type 
> agnostic.
> 
> Sometimes, however, you do need to look inside a string, and it does 
> help to have an easy encoding then.
> 
>> Or should we bite the bullet and support different flavors of the FPC 
>> libraries, for best performance on any platform? This would also leave 
>> it to the user to select his preferred encoding, stopping any UTF 
>> discussion immediately :-]
> 
> I am in favour of the RTL following the encoding that is common on a 
> platform, but not dictating a string type to a programmer. If a 
> programmers wants to use UTF-16 on Linux, or UTF-8 on Windows, the 
> infrastructure should be there to allow this.

The user always has the choice, by explicitly using
AnsiString/UTF8String or WideString/UnicodeString instead of only
"string". Question is the library support, in detail for the commonly
used components. E.g. should separate TStringList classes be provided
for Ansi (byte) and Unicode (word) strings?

The Delphi RTL (XE) supports 2 string types, with overloaded functions
for CP_ACP and Unicode, all other encodings require intermediate
conversion into UTF-16 and back again. FPC should add UTF-8 as a third
choice, and also TFileName if required.
Eventually a *real* binary encoding should be added, to keep all those 
people happy which like to use AnsiString for buffering binary data; 
rewriting legacy code, to use TBytes instead, is not user friendly at 
all :-(

DoDi