[fpc-devel] Unicode support (again)

Tue Nov 11 12:51:17 CET 2008

Michael Schnell wrote:
>
>>
>> It will at best be "friendly old school behaviour which works most of 
>> the time, but which fails as soon as the strings are not completely 
>> normalised because then you can have decomposed characters and 
>> whatnot" (which in turn easily leads to security holes due to 
>> incomplete checks, hard to reproduce bugs and "write once, debug 
>> everywhere"-style behaviour).
> Sorry, I don't understand. What not normalized behavior needs to be 
> taken into account ?
Remember that an individual code point does not nessacerally represent 
what a user would consider a character. Indeed one character may be 
representable in more than one way (either as a precomposed character or 
a sequence of base character and combining diacritic). And even if we 
ignore combining diacritics the number of console positions a string 
takes is not nessacerally equal to the code point either since many CJK 
characters take two console positions.

Given theese facts code point counts and indexes are not much more 
usefull than code unit indexes and counts.

And if you need something better than either code point count or code 
unit count then you have little choice but to pull in an external 
library. Pulling in an external library with a relatively unstable 
interface is not something the compiler or RTL should be doing IMO.