[fpc-devel] Unicodestring branch, please test and help fixing

Fri Sep 12 16:12:42 CEST 2008

listmember wrote:
> Martin Friebe wrote:
>> Just to make sure, all of this discussion is based on various collation
> No part of this discussion is based on collation.
Ok, so we were talking about different things....
>
> Here is a scenario for you:
>
> You have multilanguage text as data. Someone has asked you to search 
> it and see if a certain peice of string (in a given language) exists 
> in it.
> This search needs to be NOT case-sensitive.
Actually for you example case doesn't matter. as you need to decide if 
"ss" = "ß"
> How can you do this?
> Is it doable if TCharacter (or wahtever you call it) has no 'langauge' 
> attribite?
>
For the purpose of case-sensitivity. I still do not know of a character 
or rather a pair of upper and lower case char)  that maps different in 
some languages?
Is there a pair of character "x" and "X"  which should in some languages 
be matching upper/lower, but in other languages should not?
^^ ignore, found your example at the end of mail

Otherwise how do I understand the case-insensitive part of your 
question? Because if "x" is the lowercase of "X" in *all* languages, 
then I do not need the language specific info to do the 
none-case-sensitive compare.

Sorry if I am still missing some point...

> [Note that, here 'TCharacter' isn't necessarily an object; it might as 
> well be a simple record structure.]
Yes we agreed on this part
>>
>> Besides instead of storing it per char, you can use unused unicode as
>> start/stop markers. So it can be implemented on top of a string that
>> stores unicode-chars (and chars only, no attributes)
> Is there, in Unicode, start-stop markes that denote 'language'?
I do not know, that was why I said "unused unicode" and "implemented on 
top" (as part of the specific app)

IMHO The discussion splits here between:
1) How can this be done in a specific app
2) what should fpc provide

as for 2: This would be on top of yet (afaik) missing basic functions 
such as
 Compare using collation x (where collation is given as argument to 
compare, not as part of any string)
>> Why is language intrinsic to the text? An "A" is an "A" in any language.
>> At best language is intrinsic to sorting/comparing(case on non
>> case-sense) text
>
> Comparing is a lot more important an operation than collating --or, 
> rather, collation is achieveable only if you can do proper comparisons.
>
> Take this, for example:
>
> "if SameText(SomeString, SomeOtherString) then do ..."
> For this to work properly, in both 'SomeString' and 'SomeOtherString', 
> you need to know which language *each* character belongs to.
I would rather say:
"There are special cases where you need/want to know which language"

So I do not imply how special or none special those cases are => you do 
not always need to know. (continued below on your example)

>
> If you dont have that informtaion, you might as well not have a 
> SameText() function in FPC.
>
>>> Please note the 'case-INsensitive' keyword there.
>> Well I needed an actual example where case sense differs by language
>> (assuming we talk about language using the same charset (not comparing
>> Chinese whit English).
>
> Here is a simple example for you:
>
> "if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ..."
Well that is a good question, do you always want that to return the same?
"Busstop" and "Bußtop" (Yeah the second is not a word, but could occur 
in a text)

Also in Names this comparisons does not always apply.

the Name "Heiße" (originally with ß) can be spelled as "Heisse"
But the Name "Heisse" (originally with "ss") is never the same has "Heiße"

But as for asking me: This a specialized comparison, Similar to soundex 
(compare sound of 2 words, usually based on english)
Something like this is usually found in extension libraries, but not in 
the standard functionally of a (many/most) languages.

In any case I think this also has the minority problem. Most people do 
not want to compare pascal strings this way (and if it only is because 
of false positives)

That does not mean that I say such functionality is not desirable. It 
would be great having a unit that can be used if needed.

Based on the idea that this are optional (or 3rd party) functions, the 
normal String would not provide for this. (Besides attaching info to 
each char would probably be to costly, even if implemented in the fpc 
core string.)
Functions like this could take an additional structure declaring the 
start/stop/change point of every language.

>> In any case, I can write up several different algorithms how to do that.
> Please do. SameText(), for one, will need all the help it can get.
The initial comment was based on collation, and basically would have 
been about prioritizing in conflicts.

There are 2 parts:
1) identifying the language.

I would recommend a separate structure, with all language start points. 
It takes some work to maintain, but should work

alternatively use dynarray instead of string. Define a record holding 
all info per char that you need. overload all operators for you dynarray 
tyope, to behave as if it was a string.
dynarrays are refcounted, so you are fine

2) actual compare, you need to "normalize" all strings before comparing, 
then compare the normalized string as bytes.

normalizing means for each char to decide how to represent it. German 
"ae" could be represented as a umlaut for the compare.
Or (in German text) you expand all umlaute first.

BUT of course there is no way do deal with the ambitious "Busstop"

>
>> What I can not do (or what I do not want to do) is to decide which of
>> them other people do want to use.
> But, isn't this just that: IOW, you're deciding what other people will 
> NOT want to use if you throw the 'language' attribute (for each char) 
> out of the window..
True, I am happy to do that. NOT

Why you can always extend this. Store you string  in any of the 
following ways
1) every 2nd char is a language attribute, not a char
2) store the language attributes in a 2nd string, always pass both 
strings around

>
>> Or, if this is not what you think of, please clarify by example..
>
> Here is another typical example:
>
> SameText('Istanbul', 'istanbul') can only return true when both 
> 'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.
ok thats what I did not know. But still in most cases it will be fine to do
SameText('Istanbul', 'istanbul', lGerman)
SameText('Istanbul', 'istanbul', lTurkish)

decide at the time of comparing

If however the info was stored on the string (or char) what if one was 
Turkish, the other German ?

> Otherwise, the same SameText() has to return false.
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel