[fpc-devel] Unicodestring branch, please test and help fixing
Martin Friebe
fpc at mfriebe.de
Fri Sep 12 13:04:26 CEST 2008
Just to make sure, all of this discussion is based on various collation
for European languages? Or shall we include Arabic, Chinese and other
languages? But they have there own chars, they can be identified without
collation, so they do not need the language info, to be distinguished
from European text. (They may have collations, the same as a German text
could be handled in different collations)
listmember wrote:
>> So maybe the design is quite well thought?
>
> Adding a flag field is easy enough --if all you're doing is to do some
> sort of collation. In that sense, everything is well tought out.
>
> But..
>
> Life becomes very complicated when you begin to do things like FTS
> (full text search) on a multilanguage text in a DB engine.
>
> Your options, in this case, is just very limited:
> -- Ignore the langage issue.
> or
> -- store each language in a different field (that is if you know how
> many there will be).
>
> Do you think this is a good solution --or, a hack.
>
True, that would be hard to do (in DB or pascal, or most other places).
But again this is a very special case. And that is why none of the
frameworks (DB, pascal, ...) include it. You have to do your own solution.
At no time did I say (nor did afaik anyone else say) that you can not do
your own object based text holding objects.
The question were:
1) should FPC replace the string, by an object (like java)
2) which additional attributes should be stored by a string (per string
/ per char)
And actually both of those question can be moved out of the context of
Unicode implementation. Because, both of them could also bee applied to
current (char=byte) based strings.
I am going to leave out the object question for now. I said all I can
say in earlier mails. And also from your comments it appears more a
question of collation being stored with the string, substring, or even
each char.
As found in the last mail, there is currently no standard for handling
cross-collation in any string function (that is string function, which
could be collation based).
1) IMHO only few people would need this. For the majority it would be
unwanted overhead.
2) Within those few, there would be too many different Expectation as to
what the "standard" should be. If FPC choose one such standard at will,
it would benefit almost no one.
The best FPC could to is provide storage, for something that is not
handled or obeyed in any function handling the data. This doesn't sound
desirable to me. If anyone who needs it will have to implement the
functions, then those may add there own storage for it too.
Besides instead of storing it per char, you can use unused unicode as
start/stop markers. So it can be implemented on top of a string that
stores unicode-chars (and chars only, no attributes)
>> As for Storing info per string or per char. (Info could be anything:
>> collation, color, style, font, source-of-quote, author, creation-date,
>> file, ....) everyone would like there own. So again FPC shouldn't do it.
>> Or everyone gets all the overhead of what all the others wanted.
> Collation is a function of language.
Right but language is something you can apply to strings. You are not
forced to do so. Strings work very well without language too.
Same as you saying "no gui". Strings work without display. Font/Style is
a function of rendering. I may want to search a string but only want to
look at chars marked as bold.
Languages is an extension to string, in the same way than rendering
info, or source info is. To you language may matter a great deal. To
others other attirbutes will matter.
> All the others are not an intrinsic part of o a char at all --they
> vary by context.
Why is language intrinsic to the text? An "A" is an "A" in any language.
At best language is intrinsic to sorting/comparing(case on non
case-sense) text
>> If pascal doesn't suit the need of a specific task, choose a different
>> tool. Instead of inventing a new pascal.
>
> Thank you for the advice.
> But, instead of jailing this discussion to at best a laterally
> relevant issue of collation, can I ask you to think for a moment:
> How on earth can you do a case-INsensitive search in *any* given
> string contains multiple language substrings?
>
> Please note the 'case-INsensitive' keyword there.
Well I needed an actual example where case sense differs by language
(assuming we talk about language using the same charset (not comparing
Chinese whit English).
In any case, I can write up several different algorithms how to do
that. What I can not do (or what I do not want to do) is to decide which
of them other people do want to use.
search none-case-sensitive 'UP LOW' in ' ups upper lows lower'
with the following attributes:
'UP LOW' is a string of 2 languages.
The word UP is in a language that defines "U" and "u" as different
letters (not only differ by case, but differ the same as "a" and "b" do
differ)
The word LOW is in a languages where all letters are having low-case
equivalents (as in English)
'ups' and 'lows' are in a language which has no upper, lower , so even
in case insensitive compare 'U' <> 'u' and 'L' <> 'l'
'upper' and 'lower' are English.
How would you think a search / compare should act?
Or, if this is not what you think of, please clarify by example..
Martin
More information about the fpc-devel
mailing list