[fpc-devel] Unicode in the RTL (my ideas)

Daniël Mantione daniel.mantione at freepascal.org
Thu Aug 23 09:42:02 CEST 2012



Op Thu, 23 Aug 2012, schreef Hans-Peter Diettrich:

> Daniël Mantione schrieb:
>> Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho:
>> 
>>> On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber <mse00000 at gmail.com> 
>>> wrote:
>>>> I am not talking about Unicode. I am talking about day by day programming 
>>>> of
>>>> an average programmer where the live is easier with utf-16 than with 
>>>> utf-8.
>>>> Unicode is not done by using pos() instead of character indexes.
>>>> I think everybody knows my opinion, I stop now.
>>> 
>>> Please be clear in the terminogy. Don't say "live is easier with
>>> utf-16 than with utf-8" if you don't mean utf-16 as it is. Just say
>>> "live is easier with ucs-2 than with utf-8", then everything is clear
>>> that you are talking about ucs2 and not true utf-16.
>> 
>> That is nonsense.
>> 
>> * There are no whitespace characters beyond widechar range. This means you
>>   can write a routine to split a string into words without bothing about
>>   surrogate pairs and remain fully UTF-16 compliant.
>
> How is this different for UTF-8?

Your answer exactly demonstrates how UTF-16 can result in better Unicode 
support: You probably consider the space the only white-space character 
and would have written code that only handles the space. In Unicode you 
have the space, the non-breaking space, the half-space and probably a few 
more that I am missing.

>> * There are no characters with uppper/lowercase beyond widechar range.
>>   That means if you write cade that deals with character case you don't
>>   need to bother with surrogate pairs and still remain fully UTF-16
>>   complaint.
>
> How expensive is a Unicode Upper/LowerCase conversion per se?

I'd expect a conversion would be quite a bit faster in UTF-16, as can be a 
table lookup per character rather than a decode/re-encode per character.
But it's not about conversion per se, everyday code deals with 
character case in a lot more situations.

>> * You can group Korean letters into Korean syllables, again without
>>   bothering about surrogate pairs, as Korean is one of the many languages
>>   that is entirely in widechar range.
>
> The same applies to English and UTF-8 ;-)
> Selected languages can be handled in special ways, but not all.

I'd disagree, because there are quite a few codepoints that can be used 
for English texts beyond #128, like i.e. currency symbols, or ligatures, 
but suppose I'd follow your reasoning, the list of languages your 
Unicode aware software will handle properly is:

* English

If are interrested in proper multi-lingual support... you won't get very 
far. In UTF-16 only few of the 6000 languages in the world need 
codepoints beyond the basic multi-lingual plane. In other words you get very far.

> You mentioned Korean syllables splitting - is this a task occuring often in 
> Korean programs?

Yes, in Korean this is very important, because Korean letters are written 
in syllables, so it's a very common conversion. There are both Unicode 
points for letters and for syllables.

For example people when people type letters on the keyboard, you 
receive the letter unicode points. If you send those directly to the 
screen you see the individual letters; that's not correct Korean writing, 
you want to convert to syllables and send the Unicode points for syllables 
to the screen.

> At the begin of computer-based publishing most German texts were hard to 
> read, due to many wordbreak errors.

In western-languages, syllables are only important for word-breaks and our 
publishing software contains advanced syllable splitting algorithms. You'd 
better not use that code for Korean texts, because there exists no need to 
break words in that script.

In general... different language, different text processing algorithms...

> But another point becomes *really* important, when libraries with 
> beforementioned Unicode functions are used: The application and libraries 
> should use the *same* string encoding, to prevent frequent conversions with 
> every function call. This suggests to use the library(=platform) specific 
> string encoding, which can be different on e.g. Windows and Linux.
>
> Consequently a cross-platform program should be as insensitive as possible to 
> encodings, and the whole UTF-8/16 discussion turns out to be purely academic. 
> This leads again to an different issue: should we declare an string type 
> dedicated to Unicode text processing, which can vary depending on the 
> platform/library encoding? Then everybody can decide whether to use one 
> string type (RTL/FCL/LCL compatible) for general tasks, and the library 
> compatible type for text processing?

No disagreement here, if all your libraries are UTF-8, you don't want to 
convert everything. So if possible, write code to be as string type 
agnostic.

Sometimes, however, you do need to look inside a string, and it does 
help to have an easy encoding then.

> Or should we bite the bullet and support different flavors of the FPC 
> libraries, for best performance on any platform? This would also leave it to 
> the user to select his preferred encoding, stopping any UTF discussion 
> immediately :-]

I am in favour of the RTL following the encoding that is common on a 
platform, but not dictating a string type to a programmer. If a 
programmers wants to use UTF-16 on Linux, or UTF-8 on Windows, the 
infrastructure should be there to allow this.

Daniël


More information about the fpc-devel mailing list