[fpc-devel] Unicode in the RTL (my ideas)

Daniël Mantione daniel.mantione at freepascal.org
Wed Aug 22 22:20:35 CEST 2012



Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho:

> On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber <mse00000 at gmail.com> wrote:
>> I am not talking about Unicode. I am talking about day by day programming of
>> an average programmer where the live is easier with utf-16 than with utf-8.
>> Unicode is not done by using pos() instead of character indexes.
>> I think everybody knows my opinion, I stop now.
>
> Please be clear in the terminogy. Don't say "live is easier with
> utf-16 than with utf-8" if you don't mean utf-16 as it is. Just say
> "live is easier with ucs-2 than with utf-8", then everything is clear
> that you are talking about ucs2 and not true utf-16.

That is nonsense.

* There are no whitespace characters beyond widechar range. This means you
   can write a routine to split a string into words without bothing about
   surrogate pairs and remain fully UTF-16 compliant.
* There are no characters with uppper/lowercase beyond widechar range.
   That means if you write cade that deals with character case you don't
   need to bother with surrogate pairs and still remain fully UTF-16
   complaint.
* You can group Korean letters into Korean syllables, again without
   bothering about surrogate pairs, as Korean is one of the many languages
   that is entirely in widechar range.

Many more examples exist. It's true there exist also many examples where 
surrogates do need to be handled.

But... even if a certain piece of code doesn't handle e.g. Egyptian 
hyroglyps correctly; there is no guarantee that a UTF-8 code would do, 
since these scripts have many properties that are not compatible with text 
processing codes designed for western languages, they need a lot of custom 
code.

Daniël


More information about the fpc-devel mailing list