[fpc-devel] Unicodestring branch, please test and help fixing

Fri Sep 12 16:55:26 CEST 2008

> Actually for you example case doesn't matter. as you need to decide if
> "ss" = "ß"

And, this is only valid in German. For all other, the result must either 
be false, or undefined.

>> Is there, in Unicode, start-stop markes that denote 'language'?

> I do not know, that was why I said "unused unicode" and "implemented on
> top" (as part of the specific app)

As far as I know, there isn't a language delimiter in Unicode.

> IMHO The discussion splits here between:
> 1) How can this be done in a specific app
> 2) what should fpc provide
>
> as for 2: This would be on top of yet (afaik) missing basic functions
> such as
> Compare using collation x (where collation is given as argument to
> compare, not as part of any string)

I think we're beginning to be on the same page --but, please, can you 
refrain from using the word 'collation'; every time I see that in this 
context, I feel a strong need to open the window and shout "collation 
isn't the most important/used part of a language wrt programming" :)

>> Take this, for example:
>>
>> "if SameText(SomeString, SomeOtherString) then do ..."
>> For this to work properly, in both 'SomeString' and 'SomeOtherString',
>> you need to know which language *each* character belongs to.

> I would rather say:
> "There are special cases where you need/want to know which language"

Yes. And, if we're on our way to make FPC unicode-enabled, we need to 
take these special cases into account. Otherwise, we will likely end up 
with a half baked 'solution'.

> So I do not imply how special or none special those cases are => you do
> not always need to know. (continued below on your example)

Why would I need to ALWAYS need it. Isn't 'needed when necessary' good 
enough?

> 2) actual compare, you need to "normalize" all strings before comparing,
> then compare the normalized string as bytes.
>
> normalizing means for each char to decide how to represent it. German
> "ae" could be represented as a umlaut for the compare.
> Or (in German text) you expand all umlaute first.

IOW, SameText() and similar stuff must take normalization into account.

But, you do know that 'normalization' is a very rough assumption and 
land you in some very embarassing situations.

Here is 2 words from Turkish.

1) 'sıkıcı' which means 'boring' in English (notice the dotless small 'i's)

2) 'sikici' which means 'fucker' in English

Now, when you normalize these you get 'SIKICI' for both which --then-- 
you would assume to be the same.

Well.. I'd like to see you (or your boss) when you've come up will all 
those 'fucker's instead of all those 'boring' old farts you were lookin 
for :P

[You might probably think of a German --or some othe language-- example]

IOW, what I am trying to tell you is that normalization isn't really 
useful --it is, IMO, a stopgap solution along the path of Unicode evolution.

> BUT of course there is no way do deal with the ambitious "Busstop"

In deed. For this case, you need to know what language "Busstop" was 
written in.

>>> What I can not do (or what I do not want to do) is to decide which of
>>> them other people do want to use.
>> But, isn't this just that: IOW, you're deciding what other people will
>> NOT want to use if you throw the 'language' attribute (for each char)
>> out of the window..

> True, I am happy to do that. NOT

I am glad we have met :)

> Why you can always extend this. Store you string in any of the following
> ways
> 1) every 2nd char is a language attribute, not a char
> 2) store the language attributes in a 2nd string, always pass both
> strings around

Of course, these and even more creative hacks could be devised.

The question is, is the language an attribute of a unicode character?

>> SameText('Istanbul', 'istanbul') can only return true when both
>> 'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.

> ok thats what I did not know. But still in most cases it will be fine to do
> SameText('Istanbul', 'istanbul', lGerman)
> SameText('Istanbul', 'istanbul', lTurkish)
> decide at the time of comparing

Well, the prototype I had in mind was:

SameText('Istanbul', 'istanbul', lGerman, lTurkish)

weher the defaults for the latter 2 parameters would be lUnknown --this 
way, people who needen't be bothered about these would not even notice.

> If however the info was stored on the string (or char) what if one was
> Turkish, the other German ?

SameText('Istanbul', 'istanbul', lTurkish, lGerman)

This one must return FALSE since, in Turkish, uppercased dotted small 
'i' is DOTTED capital 'i' (i.e. 'İ').

and,

SameText('Istanbul', 'istanbul', lTurkish, lGerman)

will return TRUE since uppercasing both sides result in the same string.