[fpc-devel] Unicodestring branch, please test and help fixing

Fri Sep 12 13:04:26 CEST 2008

Just to make sure, all of this discussion is based on various collation 
for European languages? Or shall we include Arabic, Chinese and other 
languages? But they have there own chars, they can be identified without 
collation, so they do not need the language info, to be distinguished 
from European text. (They may have collations, the same as a German text 
could be handled in different collations)

listmember wrote:
>> So maybe the design is quite well thought?
>
> Adding a flag field is easy enough --if all you're doing is to do some 
> sort of collation. In that sense, everything is well tought out.
>
> But..
>
> Life becomes very complicated when you begin to do things like FTS 
> (full text search) on a multilanguage text in a DB engine.
>
> Your options, in this case, is just very limited:
>   -- Ignore the langage issue.
> or
>   -- store each language in a different field (that is if you know how 
> many there will be).
>
> Do you think this is a good solution --or, a hack.
>
True, that would be hard to do (in DB or pascal, or most other places). 
But again this is a very special case. And that is why none of the 
frameworks (DB, pascal, ...) include it. You have to do your own solution.

At no time did I say (nor did afaik anyone else say) that you can not do 
your own object based text holding objects.
The question were:
1) should FPC replace the string, by an object (like java)
2) which additional attributes should be stored by a string (per string 
/ per char)

And actually both of those question can be moved out of the context of 
Unicode implementation. Because, both of them could also bee applied to 
current (char=byte) based strings.

I am going to leave out the object question for now. I said all I can 
say in earlier mails. And also from your comments it appears more a 
question of collation being stored with the string, substring, or even 
each char.

As found in the last mail, there is currently no standard for handling 
cross-collation in any string function (that is string function, which 
could be collation based).
1) IMHO only few people would need this. For the majority it would be 
unwanted overhead.
2) Within those few, there would be too many different Expectation as to 
what the "standard" should be. If FPC choose one such standard at will, 
it would benefit almost no one.

The best FPC could to is provide storage, for something that is not 
handled or obeyed in any function handling the data. This doesn't sound 
desirable to me. If anyone who needs it will have to implement the 
functions, then those may add there own storage for it too.

Besides instead of storing it per char, you can use unused unicode as 
start/stop markers. So it can be implemented on top of a string that 
stores unicode-chars (and chars only, no attributes)

>> As for Storing info per string or per char. (Info could be anything:
>> collation, color, style, font, source-of-quote, author, creation-date,
>> file, ....) everyone would like there own. So again FPC shouldn't do it.
>> Or everyone gets all the overhead of what all the others wanted.
> Collation is a function of language.
Right but language is something you can apply to strings. You are not 
forced to do so. Strings work very well without language too.
Same as you saying "no gui". Strings work without display. Font/Style is 
a function of rendering. I may want to search a string but only want to 
look at chars marked as bold.

Languages is an extension to string, in the same way than rendering 
info, or source info is. To you language may matter a great deal. To 
others other attirbutes will matter.
> All the others are not an intrinsic part of o a char at all --they 
> vary by context.
Why is language intrinsic to the text? An "A" is an "A" in any language. 
At best language is intrinsic to sorting/comparing(case on non 
case-sense) text
>> If pascal doesn't suit the need of a specific task, choose a different
>> tool. Instead of inventing a new pascal.
>
> Thank you for the advice.
> But, instead of jailing this discussion to at best a laterally 
> relevant issue of collation, can I ask you to think for a moment:
> How on earth can you do a case-INsensitive search in *any* given 
> string contains multiple language substrings?
>
> Please note the 'case-INsensitive' keyword there.
Well I needed an actual example where case sense differs by language 
(assuming we talk about language using the same charset (not comparing 
Chinese whit English).

In any case, I can write up several different algorithms  how to do 
that. What I can not do (or what I do not want to do) is to decide which 
of them other people do want to use.

search none-case-sensitive 'UP LOW' in ' ups upper lows lower'

with the following attributes:
'UP LOW' is a string of 2 languages.
 The word UP is in a language that defines "U" and "u" as different 
letters (not only differ by case, but differ the same as "a" and "b" do 
differ)
 The word LOW is in a languages where all letters are having low-case 
equivalents (as in English)

'ups' and 'lows'  are in a language which has no upper, lower , so even 
in case insensitive compare  'U' <> 'u' and 'L' <> 'l'
'upper' and 'lower' are English.

How would you think a search / compare should act?

Or, if this is not what you think of, please clarify by example..

Martin