[fpc-devel] Unicodestring branch, please test and help fixing

Mattias Gärtner nc-gaertnma at netcologne.de
Fri Sep 12 15:54:58 CEST 2008


Zitat von listmember <listmember at letterboxes.org>:

>[...]
> You have multilanguage text as data. Someone has asked you to search it
> and see if a certain peice of string (in a given language) exists in it.
>
> This search needs to be NOT case-sensitive.
>
> How can you do this?
>
> Is it doable if TCharacter (or wahtever you call it) has no 'langauge'
> attribite?
>
> [Note that, here 'TCharacter' isn't necessarily an object; it might as
> well be a simple record structure.]

AFAIK for most programmers this is not a common task. Most programs need less
(one language or codepage) or more (phonetic, semantic, statistical search).
Can you explain, why you think that this particular problem requires compiler
magic?

> []
> Is there, in Unicode, start-stop markes that denote 'language'?

Is it needed?
Are the any unicode characters, that upper/lower depend on language?


>[...]
> Comparing is a lot more important an operation than collating --or,
> rather, collation is achieveable only if you can do proper comparisons.
>
> Take this, for example:
>
> "if SameText(SomeString, SomeOtherString) then do ..."
>
> For this to work properly, in both 'SomeString' and 'SomeOtherString',
> you need to know which language *each* character belongs to.

Comparing texts can be done with various meanings. For example: byte comparison,
simple case insensitive comparison, not literal comparison, compare like this
library, ....
Which one do you mean?


>[...]
> Here is a simple example for you:
>
> "if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ..."
>
> Now.. how are you going to decide that SameText() function here returns
> true unless you have information that the substring 'FoolStraße' is in
> German?

The two strings have the same language, but are written with different
Rechtschreibung. You need dictionaries and spelling systems to implement such
comparisons. This is beyond a compiler or a RTL.


> I know that this is a very simple example --that 'ß' exists only in
> German, and that you could infer that when you met that char.
>
> But, this hightlights the problem --and there are times when you cannot
> infer.
>
> > In any case, I can write up several different algorithms how to do that.
>
> Please do. SameText(), for one, will need all the help it can get.
>
> > What I can not do (or what I do not want to do) is to decide which of
> > them other people do want to use.
>
> But, isn't this just that: IOW, you're deciding what other people will
> NOT want to use if you throw the 'language' attribute (for each char)
> out of the window..

What about loan words?


> > Or, if this is not what you think of, please clarify by example..
>
> Here is another typical example:
>
> SameText('Istanbul', 'istanbul') can only return true when both
> 'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.
>
> Otherwise, the same SameText() has to return false.

I doubt that it is that easy.

Mattias




More information about the fpc-devel mailing list