[fpc-devel] Unicodestring branch, please test and help fixing

Fri Sep 12 17:07:33 CEST 2008

>> [Note that, here 'TCharacter' isn't necessarily an object; it might as
>> well be a simple record structure.]
>
> AFAIK for most programmers this is not a common task. Most programs need less
> (one language or codepage)

But, when you're talking unicode, codepage is rather meaningless --isn't it?

> or more (phonetic, semantic, statistical search).
> Can you explain, why you think that this particular problem requires compiler
> magic?

See my other reply to Martin Friebe, in another sub thread.

>> Is there, in Unicode, start-stop markes that denote 'language'?
>
> Is it needed?
> Are the any unicode characters, that upper/lower depend on language?

Yes. See my other reply to Martin Friebe, in another sub thread.

>> Take this, for example:
>>
>> "if SameText(SomeString, SomeOtherString) then do ..."
>>
>> For this to work properly, in both 'SomeString' and 'SomeOtherString',
>> you need to know which language *each* character belongs to.
>
> Comparing texts can be done with various meanings. For example: byte comparison,
> simple case insensitive comparison, not literal comparison, compare like this
> library, ....
> Which one do you mean?

Byte comparison isn't what I am worried about.

In every language, there a pretty known and fixed (by now) rules that 
apply to string comparison. I am referring to those rules.

>> [...]
>> Here is a simple example for you:
>>
>> "if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ..."
>>
>> Now.. how are you going to decide that SameText() function here returns
>> true unless you have information that the substring 'FoolStraße' is in
>> German?
>
> The two strings have the same language, but are written with different
> Rechtschreibung. You need dictionaries and spelling systems to implement such
> comparisons. This is beyond a compiler or a RTL.

Are you sure. I was under the impression that Unicode covers these 
--without needing further data.

> What about loan words?

For all practical purposes, 'loan words' belong to the language they are 
used in.

Except the case where we'd be discussing etymology.

>> SameText('Istanbul', 'istanbul') can only return true when both
>> 'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.
>>
>> Otherwise, the same SameText() has to return false.
>
> I doubt that it is that easy.

Well.. I never said that it would be that easy.

But, if strip off the language attribute from the caharcater, it will be 
impossible --or several orders of magnitude harder for those people who 
need it.

You can, of course, ignore all that.

But, then, what is the point of going unicode?

We were just fine doing things ANSI-centric..

Weren't we?