[fpc-devel] Unicodestring branch, please test and help fixing

Fri Sep 12 15:21:40 CEST 2008

Martin Friebe wrote:
> Just to make sure, all of this discussion is based on various collation

No part of this discussion is based on collation.

> I am going to leave out the object question for now. I said all I can
> say in earlier mails.

That's good. Thank you.

> And also from your comments it appears more a question of collation
 > being stored with the string, substring, or even each char.

Martin, are you doing this on purpose? I mean, are you intentionaly 
driving me up the wall?

Seriously. Can't you forget/drop this 'collation' word?!

And, then, think a little deeper.

Here is a scenario for you:

You have multilanguage text as data. Someone has asked you to search it 
and see if a certain peice of string (in a given language) exists in it.

This search needs to be NOT case-sensitive.

How can you do this?

Is it doable if TCharacter (or wahtever you call it) has no 'langauge' 
attribite?

[Note that, here 'TCharacter' isn't necessarily an object; it might as 
well be a simple record structure.]

> As found in the last mail, there is currently no standard for handling
> cross-collation in any string function (that is string function, which
> could be collation based).
> 1) IMHO only few people would need this. For the majority it would be
> unwanted overhead.
> 2) Within those few, there would be too many different Expectation as to
> what the "standard" should be. If FPC choose one such standard at will,
> it would benefit almost no one.

You're still stuck with that wretched word 'collation'.

> The best FPC could to is provide storage, for something that is not
> handled or obeyed in any function handling the data. This doesn't sound
> desirable to me. If anyone who needs it will have to implement the
> functions, then those may add there own storage for it too.
>
> Besides instead of storing it per char, you can use unused unicode as
> start/stop markers. So it can be implemented on top of a string that
> stores unicode-chars (and chars only, no attributes)

Is there, in Unicode, start-stop markes that denote 'language'?

>> All the others are not an intrinsic part of o a char at all --they
>> vary by context.

> Why is language intrinsic to the text? An "A" is an "A" in any language.
> At best language is intrinsic to sorting/comparing(case on non
> case-sense) text

Comparing is a lot more important an operation than collating --or, 
rather, collation is achieveable only if you can do proper comparisons.

Take this, for example:

"if SameText(SomeString, SomeOtherString) then do ..."

For this to work properly, in both 'SomeString' and 'SomeOtherString', 
you need to know which language *each* character belongs to.

If you dont have that informtaion, you might as well not have a 
SameText() function in FPC.

>> Please note the 'case-INsensitive' keyword there.
> Well I needed an actual example where case sense differs by language
> (assuming we talk about language using the same charset (not comparing
> Chinese whit English).

Here is a simple example for you:

"if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ..."

Now.. how are you going to decide that SameText() function here returns 
true unless you have information that the substring 'FoolStraße' is in 
German?

I know that this is a very simple example --that 'ß' exists only in 
German, and that you could infer that when you met that char.

But, this hightlights the problem --and there are times when you cannot 
infer.

> In any case, I can write up several different algorithms how to do that.

Please do. SameText(), for one, will need all the help it can get.

> What I can not do (or what I do not want to do) is to decide which of
> them other people do want to use.

But, isn't this just that: IOW, you're deciding what other people will 
NOT want to use if you throw the 'language' attribute (for each char) 
out of the window..

> Or, if this is not what you think of, please clarify by example..

Here is another typical example:

SameText('Istanbul', 'istanbul') can only return true when both 
'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.

Otherwise, the same SameText() has to return false.