[fpc-devel] Unicodestring branch, please test and help fixing

Martin Friebe fpc at mfriebe.de
Fri Sep 12 17:37:04 CEST 2008


listmember wrote:
>> IMHO The discussion splits here between:
>> 1) How can this be done in a specific app
>> 2) what should fpc provide
>>
>> as for 2: This would be on top of yet (afaik) missing basic functions
>> such as
>> Compare using collation x (where collation is given as argument to
>> compare, not as part of any string)
> I think we're beginning to be on the same page --but, please, can you 
> refrain from using the word 'collation'; every time I see that in this 
> context, I feel a strong need to open the window and shout "collation 
> isn't the most important/used part of a language wrt programming" :)
Sorry, but I meant comparing with collation. I did not mean comapring 
within labguage context.

language context is to complex to be basic (see busstop below)
>> 2) actual compare, you need to "normalize" all strings before comparing,
>> then compare the normalized string as bytes.
>>
>> normalizing means for each char to decide how to represent it. German
>> "ae" could be represented as a umlaut for the compare.
>> Or (in German text) you expand all umlaute first.
>
> IOW, SameText() and similar stuff must take normalization into account.
>
> But, you do know that 'normalization' is a very rough assumption and 
> land you in some very embarassing situations.
>
> Here is 2 words from Turkish.
>
> 1) 'sıkıcı' which means 'boring' in English (notice the dotless small 
> 'i's)
>
> 2) 'sikici' which means 'fucker' in English
Depends how you normalize. Normalize should sbstitute all *equal* 
letters (or combination thereof) into one single form. That allows 
comparing and matching them.
But yes, even this is very limited (busstop), because even if you know 
the language of the wort (german in my example) you do not know its meaning.

Without a full dictionary, you do not know if ss and german-sharp-s are 
the same or not.
So basically what you want to do, can only be done with a full 
dictionary. Or you have to accept false positives.

I also fail to see why a utf8 string is a half baked solution. It will 
serve most people fine. It can be extended for those who want more.

IMHO this is a case for an add-on library.
And apparently no one has yet volunteered to write it

>
> Now, when you normalize these you get 'SIKICI' for both which --then-- 
> you would assume to be the same.
>
>> BUT of course there is no way do deal with the ambitious "Busstop"
>
> In deed. For this case, you need to know what language "Busstop" was 
> written in.
you need a dictionary. knowing it is German is not enough. because all 
that "it is german" tells you is, that "ss" maybe a sharp-s, but doesn't 
have to be
>>>> What I can not do (or what I do not want to do) is to decide which of
>>>> them other people do want to use.
>>> But, isn't this just that: IOW, you're deciding what other people will
>>> NOT want to use if you throw the 'language' attribute (for each char)
>>> out of the window..
>> True, I am happy to do that. NOT
> I am glad we have met :)
have we? I remember a mail conversation, but not an actual meeting :) SCNR
>> Why you can always extend this. Store you string in any of the following
>> ways
>> 1) every 2nd char is a language attribute, not a char
>> 2) store the language attributes in a 2nd string, always pass both
>> strings around
>
> Of course, these and even more creative hacks could be devised.
> The question is, is the language an attribute of a unicode character?
(I assume "mandatory attribute")

Well as much as it is or is not an attribute of a latin1 or iso-whatever 
char.

I do not think it is. I have no proof. But a lot of people seem to think 
so, if I goggle Unicode (or any other char/latin./iso...) I get nice 
character tables; and no language info.




More information about the fpc-devel mailing list