[fpc-devel] Re: enumerators

Jonas Maebe jonas.maebe at elis.ugent.be
Wed Nov 17 13:20:59 CET 2010


On 17 Nov 2010, at 12:23, Michael Schnell wrote:

> Regarding that handling surrogate pairs needs tables while UTF/UCS  
> handling can be done by simple algorithms and that (AFAIK) surrogate  
> pairs are used only in certain environments (Mac and what else ?)

Surrogate pairs have nothing to do with Mac OS X. Surrogate pairs are  
required when encoding any codepoint in UTF-16 whose UTF32 value is >=  
$10000.

You are probably thinking of are decomposed characters (where e.g. "e"  
and "¨" are encoded separately, instead of as "ë"). The RTL will never  
do anything special about them, since they are two regular separate  
codepoints. And then there's of course the fact that more than one  
composed character can map to the same decomposed character, see e.g. http://unicode.org/reports/tr15/#Primary_Exclusion_List_Table 
, and many other issues listed on that page.

In general: if you want to assume that a unicode string is in a  
particular form, convert it to a particular canonical form and operate  
on that (and keep in mind that you may destroy data in the process,  
like with most code page conversions).


Jonas


More information about the fpc-devel mailing list