[fpc-devel] Unicode support in RTL - Roadmap

Fri Nov 21 16:56:05 CET 2008

On 21 Nov 2008, at 16:16, Michael Schnell wrote:

>>> So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü')  
>>> would be 1.
>> Or 2, depending on whether it's predcomposed or decomposed.
> I seem to remember that we discussed this some time ago and the  
> result was that the compose (MAC style ?)

Decomposed and precomposed have nothing to do with Windows vs Mac OS X  
vs Linux vs whatever. They are both equally valid ways to represent  
UTF strings and both have their uses (on all platforms). All programs  
should also be prepared to deal with them, since you never know what  
kind of input you will get.

> characters in fact are a single code point (Unicode character) that  
> consists of two (maybe more ? ) complete code points that are tied  
> together by some special coding, so IMHO it can be considered as a  
> single Unicode character in both cases. If this would result in a  
> huge table of possibly composed characters I thing we would stick to  
> the concept of providing  a decent functionality and restrict on  
> those that are currently used by the "customers" we normally address  
> (Mac in Europe and America).

I think you are talking about a different "we". Further, inventing our  
own meanings of what a "code point" or "unicode character" means is an  
extremely bad idea (you'd also have to rename UTF*Point* routines to  
UTF*FPCLikeChar* so they properly indicate the fact that they do not  
deal with code points). UTF by itself already has enough variations to  
deal with, we will not add our own.

>>> which does not make sense if UTF8PointLength(utfstring_1) is  
>>> smaller than UTF8PointLength(utfstring_2).
>> It does not make any sense under any circumstances, because there  
>> is no way for "UTF8PointSetLength" to know how many bytes it has to  
>> allocate when you pass a value (any value, regardless of where it  
>> comes from) to it.
> If UTF8PointLength(utfstring_1) is greater than  
> UTF8PointLength(utfstring_2) no new bytes need to be allocated
>
> but the function is just equivalent to
>
> utfstring1 := UTF8PointCopy(utfstring1, 1,  
> UTF8PointLength(utfstring_2));
>
> To me this does not seem to impose any problem.

Except if the point is to reserve exactly enough space for utfstring1  
and to overwrite its contents with something else afterwards (using  
move() or whatever). That's a very common use of setlength (at least  
in the FPC run time library, and I guess elsewhere as well). The fact  
that it also doesn't work if the string has to be made longer is  
basically the same problem.

Your system just does not work, and the more examples you give the  
more it falls down, as far as I can see. Please first write a wiki  
page explaining how to deal with all cases, or at least noting which  
cases will not work. Only then it is possible to decide on whether or  
not it is both feasible and worthwhile to go through the trouble of  
implementing all this. Without it, I feel I am mainly wasting my time  
writing these mails because it seems you haven't thought it through  
yet at all.

Jonas