[fpc-devel] Unicode support in RTL - Roadmap
Jonas Maebe
jonas.maebe at elis.ugent.be
Fri Nov 21 16:56:05 CET 2008
On 21 Nov 2008, at 16:16, Michael Schnell wrote:
>>> So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü')
>>> would be 1.
>> Or 2, depending on whether it's predcomposed or decomposed.
> I seem to remember that we discussed this some time ago and the
> result was that the compose (MAC style ?)
Decomposed and precomposed have nothing to do with Windows vs Mac OS X
vs Linux vs whatever. They are both equally valid ways to represent
UTF strings and both have their uses (on all platforms). All programs
should also be prepared to deal with them, since you never know what
kind of input you will get.
> characters in fact are a single code point (Unicode character) that
> consists of two (maybe more ? ) complete code points that are tied
> together by some special coding, so IMHO it can be considered as a
> single Unicode character in both cases. If this would result in a
> huge table of possibly composed characters I thing we would stick to
> the concept of providing a decent functionality and restrict on
> those that are currently used by the "customers" we normally address
> (Mac in Europe and America).
I think you are talking about a different "we". Further, inventing our
own meanings of what a "code point" or "unicode character" means is an
extremely bad idea (you'd also have to rename UTF*Point* routines to
UTF*FPCLikeChar* so they properly indicate the fact that they do not
deal with code points). UTF by itself already has enough variations to
deal with, we will not add our own.
>>> which does not make sense if UTF8PointLength(utfstring_1) is
>>> smaller than UTF8PointLength(utfstring_2).
>> It does not make any sense under any circumstances, because there
>> is no way for "UTF8PointSetLength" to know how many bytes it has to
>> allocate when you pass a value (any value, regardless of where it
>> comes from) to it.
> If UTF8PointLength(utfstring_1) is greater than
> UTF8PointLength(utfstring_2) no new bytes need to be allocated
>
> but the function is just equivalent to
>
> utfstring1 := UTF8PointCopy(utfstring1, 1,
> UTF8PointLength(utfstring_2));
>
> To me this does not seem to impose any problem.
Except if the point is to reserve exactly enough space for utfstring1
and to overwrite its contents with something else afterwards (using
move() or whatever). That's a very common use of setlength (at least
in the FPC run time library, and I guess elsewhere as well). The fact
that it also doesn't work if the string has to be made longer is
basically the same problem.
Your system just does not work, and the more examples you give the
more it falls down, as far as I can see. Please first write a wiki
page explaining how to deal with all cases, or at least noting which
cases will not work. Only then it is possible to decide on whether or
not it is both feasible and worthwhile to go through the trouble of
implementing all this. Without it, I feel I am mainly wasting my time
writing these mails because it seems you haven't thought it through
yet at all.
Jonas
More information about the fpc-devel
mailing list