[fpc-devel] Unicode proceedings

Michael Schnell mschnell at lumino.de
Thu Nov 17 12:59:49 CET 2011


On 11/16/2011 05:24 PM, Marco van de Voort wrote:
>
> The original proposal was like (A) but only for base unicode encodings
> (utf8/16 and maybe 32), but went down due to either excess conversions and
> need for overloading.  The amount of overloading for the current 3-4
> stringtypes is already a bit much.  (short/ansi/wide/unicodestring)
...

This is exactly what I meant to say. (It's a viable definition, but...)

> (B) was a counter proposal floated by Florian. The cons were pretty much
> that you had to guard every encoding sensitive routine (e.g.  every API/OS
> call) to enforce the string contained the encoding you expected.  Combining
> one and two byte types also cast doubt on the [] operator's performance.
...

This is exactly what I meant to say. (It's a viable definition, but...)
> Then Yury proposed to combine A and B, in retrospect a bit like the current
> Delphi implementation but with one and two byte encodings in one type.
...

Yep. But IMHO the wording I proposed by (C): such as "object-alike" IMHO 
leads to a more "understandable" definition, in effect providing 
identical (or at least very similar) results and i.e. to at most an 
identical implementation, as most of the differences might be considered 
"implementation depending" (not defined by the pure, documented 
definition of the behavior (such as what happens with "intersexual" 
variables).

> Note that the Delphi2009 definition is theoretically capable of combining one and
> two bytes in one type (like Yury's).
As I don't have such a Delphi please help me to understand:

Is there a general type dedicated for being able to hold any encoding ? 
(be it ANSIxyz, UTF-8 or UTF-16) ?

Of course, when assigning something to a "strictly encoded" String (the 
type denotes the encoding) the definition of what is supposed to happen 
is clear and obvious. If the Type name or the dynamic encoding of the 
target (even if Length=0) is used for deciding about a conversion is an 
"implementation detail".

Is there a clear definition about what happens if the "general" string 
type is the target ? Here, IMHO, it would be very hard to understand, if 
the history of the target variable (i.e. has a string of some encoding 
been assigned to it before) would decide about a conversion. IMHO a 
General string type needs to be handled as fully dynamically encoded and 
thus as a target always needs to get the source's encoding.

Such "assignment" can happen with ":=", and with function calls. With 
function calls there is "value" and "var" parameters. All this should 
behave identical, any other behavior would be very hard to understand.

And on top of this: what is the type "String" ? Of course the general 
String type would be an obvious choice, but perhaps (depending on the 
implementation) this might result in worse performance in certain cases 
of usage and thus some strict (specifically encoded) Type could be 
chosen. (In fact I will never again use "String" in any project, but use 
a propriety  type defined in some central unit so that I at any time can 
do a central change to some specific string type.)

> Embarcadero kept the two types separate,
Making a decently clear definition of the behavior (from a user's view) 
rather complicated.
> - backwards compatibility (and thus the hurdle to upgrade)
This did not seem to have worked. Everybody, I asked, who migrated a 
large project to the new strings, was very unhappy.
>
> Explain parent-child for explicitely this context. This kind of stuff is
> what I meant with self contained. Don't use terms that you don't fully
> describe elsewhere.
Sorry that I seemingly failed with my intention to help understanding 
what I meant by stating the similarity to the objects' parent-child 
relationship.

I just meant a "General" (or "Raw") string type needs to exist that can 
hold any encoding and needs no conversion when a strictly encoded 
variable is assigned to it (via ":=", value parameter or var parameter).
Similar as with a parent object it "is" any strictly encoded string 
type, so that when using it as a nominal parameter of a function, it can 
- without conversion - take any strict string type (and of course the 
general type, too).
Similar as with an object's runtime type (such as via "is" and "as"), 
the encoding of a General string can be detected and handled when 
appropriate (e.g. when combining with a another (strict or general) 
string or assigning to a strict string variable might request for 
conversion).
>> the RAW string type and the types supposed to hold a specific encoding.
> Explain RAW.
See above. "General" or "not Strict" would be more appropriate (I took 
the term "Raw" from other recent discussions on the issue.)
> Yes, I never really considered (B) a workable solution. It would break
> existing code, and the ways to deal with the other problems was hackish at
> best.
Yep. But I was told by unhappy coders that the new Delphi way breaks a 
lot of existing code, as well. So a new FPC way has a chance of being 
better. :) This might (or might not) be a way to do this.
> I think the A-B hybrid is better than either A or B. And that is what is
> being implemented.
Yep. Only the definition of it's behavior of course is a lot more 
complex. In fact with "C" I tried (and failed) to find a proper basic 
definition of exactly this.
> Then describe how that should work. What should happen if I pass such marked
> raw string to a function that wants encoding<y>?
I hope I did this some lines up. But better see below
>> So IMHO the Parent-Child (alike) relationship between RAW and any other
>> new string type is quite obvious.
> No it is not. And you don't make the situation any clearer by writing yet
> another message without a concrete description (either using specs or with
> examples), and not defining RAW and exactly how the parent-child relation
> works.
I hope I did this some lines up. But better see below
>
> I've been doing OOP for 15 years now, but I've no idea whatsoever.
Obviously it was not as a good idea as I thought to state the similarity 
between the relation between a single "General" and multiple "strictly 
encoded" string types regarding a Parent object and multiple Child 
objects. But I am not at all against dropping this analogy and just 
using a "self-contained" definition. Moreover I think we agree upon 
dropping the term "Raw" for the general string type.

So the wording could be similar to:
  - There is a General String type that can hold any encoding (and any 
width of the code elements)
  - There are lots of Strict String types that are supposed to hold 
strings in a predefined encoding (somebody else might describe in detail 
how these are defined)
  - There are the appropriate single-character types corresponding to 
all of the above string types
  - A variable of the General string type or the General character type 
only has a defined encoding if it before has been the target of an 
assignment of a not empty string or a character with defined encoding.
  - If just using strict types the conversion rules are obvious
  - If assigning a value to a General string or character, (via ":=", 
value or var parameter) no conversion is done.
  - There are means to detect the actual encoding of a general String or 
Character variable.
  - If combining any Strings/Chars with General String or Chars, the 
coding of the General ones is fetched from their embedded dynamic 
encoding definition to decide upon conversion.

I hope this is more like what you'd like to see.

Note that the recent discussion about how variable passing with RAW / 
not RAW strings is implemented might be decided by such a definition.

Note that Delphi seemingly introduced the encoding types $0000 for 
"None"/"to be assigned" and $FFFF" for "Raw". This allows for a string 
variable to be "General" or "Raw" with different meaning. How use this 
to implement a  proper handling of General variables that hold a certain 
encoding but still are strictly General so that they get a different 
encoding with the next assignment ?  I don't see of / if / if not this 
is helps  implementing the above definition of if this or if it is a 
contradiction to same and/or provides nasty ambiguity.

> Of course you can try to create some object based stringtype like C++, but
> then you will have to deal with all its problems, and the fact that Pascal's
> object model is not the same as C++'s. Also stuff that we take for granted
> (like copy-on-write) would be hard.
Of course I agree.

Thanks,
-Michael




More information about the fpc-devel mailing list