[fpc-devel] Unicodestring branch, please test and help fixing
listmember
listmember at letterboxes.org
Fri Sep 12 15:21:40 CEST 2008
Martin Friebe wrote:
> Just to make sure, all of this discussion is based on various collation
No part of this discussion is based on collation.
> I am going to leave out the object question for now. I said all I can
> say in earlier mails.
That's good. Thank you.
> And also from your comments it appears more a question of collation
> being stored with the string, substring, or even each char.
Martin, are you doing this on purpose? I mean, are you intentionaly
driving me up the wall?
Seriously. Can't you forget/drop this 'collation' word?!
And, then, think a little deeper.
Here is a scenario for you:
You have multilanguage text as data. Someone has asked you to search it
and see if a certain peice of string (in a given language) exists in it.
This search needs to be NOT case-sensitive.
How can you do this?
Is it doable if TCharacter (or wahtever you call it) has no 'langauge'
attribite?
[Note that, here 'TCharacter' isn't necessarily an object; it might as
well be a simple record structure.]
> As found in the last mail, there is currently no standard for handling
> cross-collation in any string function (that is string function, which
> could be collation based).
> 1) IMHO only few people would need this. For the majority it would be
> unwanted overhead.
> 2) Within those few, there would be too many different Expectation as to
> what the "standard" should be. If FPC choose one such standard at will,
> it would benefit almost no one.
You're still stuck with that wretched word 'collation'.
> The best FPC could to is provide storage, for something that is not
> handled or obeyed in any function handling the data. This doesn't sound
> desirable to me. If anyone who needs it will have to implement the
> functions, then those may add there own storage for it too.
>
> Besides instead of storing it per char, you can use unused unicode as
> start/stop markers. So it can be implemented on top of a string that
> stores unicode-chars (and chars only, no attributes)
Is there, in Unicode, start-stop markes that denote 'language'?
>> All the others are not an intrinsic part of o a char at all --they
>> vary by context.
> Why is language intrinsic to the text? An "A" is an "A" in any language.
> At best language is intrinsic to sorting/comparing(case on non
> case-sense) text
Comparing is a lot more important an operation than collating --or,
rather, collation is achieveable only if you can do proper comparisons.
Take this, for example:
"if SameText(SomeString, SomeOtherString) then do ..."
For this to work properly, in both 'SomeString' and 'SomeOtherString',
you need to know which language *each* character belongs to.
If you dont have that informtaion, you might as well not have a
SameText() function in FPC.
>> Please note the 'case-INsensitive' keyword there.
> Well I needed an actual example where case sense differs by language
> (assuming we talk about language using the same charset (not comparing
> Chinese whit English).
Here is a simple example for you:
"if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ..."
Now.. how are you going to decide that SameText() function here returns
true unless you have information that the substring 'FoolStraße' is in
German?
I know that this is a very simple example --that 'ß' exists only in
German, and that you could infer that when you met that char.
But, this hightlights the problem --and there are times when you cannot
infer.
> In any case, I can write up several different algorithms how to do that.
Please do. SameText(), for one, will need all the help it can get.
> What I can not do (or what I do not want to do) is to decide which of
> them other people do want to use.
But, isn't this just that: IOW, you're deciding what other people will
NOT want to use if you throw the 'language' attribute (for each char)
out of the window..
> Or, if this is not what you think of, please clarify by example..
Here is another typical example:
SameText('Istanbul', 'istanbul') can only return true when both
'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.
Otherwise, the same SameText() has to return false.
More information about the fpc-devel
mailing list