[fpc-devel] Unicodestring branch, please test and help fixing

listmember listmember at letterboxes.org
Fri Sep 12 19:23:00 CEST 2008


> Sorry, but I meant comparing with collation. I did not mean comapring
> within labguage context.

How can you do /proper/ collation while ignoring the language context?

>> 1) 'sıkıcı' which means 'boring' in English (notice the dotless small
>> 'i's)
>>
>> 2) 'sikici' which means 'fucker' in English

> Depends how you normalize. Normalize should sbstitute all *equal*
> letters (or combination thereof) into one single form. That allows
> comparing and matching them.

Again, we're not quite on the same page here...

What you're referring is more like 'Text Normalization' [ 
http://en.wikipedia.org/wiki/Text_normalization ] where you do 
definitely need a very comprehensive dictionary so that '1' is equal to 
'one' and '1st' is 'first', etc. (if your language is English).

Whereas, what I am referring to is 'Unicode Normalization' [ 
http://en.wikipedia.org/wiki/Unicode_normalization ].

This one is much narrower in scope. It deals basically with what I can 
refer to as 'character glyphs'.

Now, from what I understand from the definitions of 'Unicode 
Normalization' there are 2 ways of doing it:

1) You decompose both texts (so that you have all 'weird' characters 
ezpanded into their combining characters)

2) You compose both texts (so that, you have as few or no combining 
characters)

This is done, obviously, to get them both in the same format --to make 
life easier to compare.

If you do no other operation on these two texts before you compare them, 
this is called Canonical Equivalnece Test --each 'character glyph' in 
each text must be the same.

For Canonical Equivalnece Test, you do not need to have any 'language' 
attribute --afer all, you're doing a simple byte-wise test.

On the other hand, if you wish to do a broader comparison, 
Compatibility Equivalnece Test or something other, you will need to do a 
little more work on those texts:

Normalization is one of them. I suggest you take a look at the 
'Normalization' heading under 
http://en.wikipedia.org/wiki/Unicode_normalization

Trouble with the 'Normalization' described there is, it is far too crude 
for quite a lot of purposes.

A better form of comparison is, converting both text to either uppercase 
or to lowercase.

And, once we do this, we hit two walls (or obstacles) to overcome. The 
steps I can think of are:

1) Equivalent code points. We need first to 'compose' the text and then 
substitute the relevant (and preferred) equivalent code points for any 
'character glyph's in the texts.

2) We also need to take care of stuff like language dependent case 
transforms. See http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I

As far as I know, this is the only 'proper' thing to do for search and 
comparison operations under unicode.

I know it will be slower, but, that is the price to pay.

Note: The reason I used the term 'character glyphs' is because, several 
codepoint can be combined to make a 'character glyph'.

See the definition of Code Point [ http://unicode.org/glossary/ ] which 
says:

"Code Point: Any value in the Unicode codespace; that is, the range of 
integers from 0 to 10FFFF16."

As an example, from the above Wiki article, we can use 2 code points to 
produce a 'character glyph', such as

'n' + '~' --> ñ

> But yes, even this is very limited (busstop), because even if you know
> the language of the wort (german in my example) you do not know its
> meaning.

You do not worry about the meaning at all. In all languages (I guess) 
there are several words that may be written the same but mean different 
things.

> Without a full dictionary, you do not know if ss and german-sharp-s are
> the same or not.

True. But, if you do know it is in German, then you definitely know they 
are. And, this makes a lot of difference.

> So basically what you want to do, can only be done with a full
> dictionary. Or you have to accept false positives.

Nope. No false positives in text level.

You can always, of course, get false positives in semantic level --such 
as when you're looking for 'apple' (the fruit) and 'Apple' (the brand 
name), but that's a completely different problem.

> I also fail to see why a utf8 string is a half baked solution. It will
> serve most people fine. It can be extended for those who want more.

I have nothing against UFT-8 or any other encoding schemes. It is just 
that --en encoding scheme. Most handy as a means of transport data from 
one medium/app to another.

But, UFT-8 does in no way cover the whole of Unicode or is a complete 
solution for dealing with unicode. It is, after all, an encoding scheme.

>>> BUT of course there is no way do deal with the ambitious "Busstop"

Not even if you knew that "Busstop" was a german string?

>> In deed. For this case, you need to know what language "Busstop" was
>> written in.
> you need a dictionary. knowing it is German is not enough. because all
> that "it is german" tells you is, that "ss" maybe a sharp-s, but doesn't
> have to be

A dictionary, then, wouldn't help you either because all it could tell 
you is that it could be either a loan word or a native word.

>>> True, I am happy to do that. NOT
>> I am glad we have met :)
> have we? I remember a mail conversation, but not an actual meeting :) SCNR

Well we haven't met face to face; but (in this discussion) we seem to 
have met at a common point. :)

>> Of course, these and even more creative hacks could be devised.
>> The question is, is the language an attribute of a unicode character?
> (I assume "mandatory attribute")
>
> Well as much as it is or is not an attribute of a latin1 or iso-whatever
> char.

Well.. Does it have to be Latin1?

I keep giving you Turkish examples.

And, Turkish --hold on to your seats-- Latin5 or ISO-8859-9 [ 
http://en.wikipedia.org/wiki/ISO-8859-9 ]

> I do not think it is. I have no proof. But a lot of people seem to think
> so, if I goggle Unicode (or any other char/latin./iso...) I get nice
> character tables; and no language info.

See the link above [ http://en.wikipedia.org/wiki/ISO-8859-9 ].

For some reason, they felt the need to say "ISO 8859-9, also known as 
Latin-5 or 'Turkish'" :)

Similarly, for ISO-8859-7, [ http://en.wikipedia.org/wiki/ISO_8859-7 ],
they had to say this: "ISO 8859-7, also known as Greek".

Do you still think character sets are independent of languages?

Question:

Does the fact that there is something called 'Unicode' mean we have 
invented a whole new language that rules them all, or does just mean 
that it is a pool of all (most) known alphabets?

If it is the latter, you still need to know in what language is a given 
piece of string in that alphabet soup..





More information about the fpc-devel mailing list