[fpc-devel] Unicode proceedings
Michael Schnell
mschnell at lumino.de
Thu Nov 17 12:59:49 CET 2011
On 11/16/2011 05:24 PM, Marco van de Voort wrote:
>
> The original proposal was like (A) but only for base unicode encodings
> (utf8/16 and maybe 32), but went down due to either excess conversions and
> need for overloading. The amount of overloading for the current 3-4
> stringtypes is already a bit much. (short/ansi/wide/unicodestring)
...
This is exactly what I meant to say. (It's a viable definition, but...)
> (B) was a counter proposal floated by Florian. The cons were pretty much
> that you had to guard every encoding sensitive routine (e.g. every API/OS
> call) to enforce the string contained the encoding you expected. Combining
> one and two byte types also cast doubt on the [] operator's performance.
...
This is exactly what I meant to say. (It's a viable definition, but...)
> Then Yury proposed to combine A and B, in retrospect a bit like the current
> Delphi implementation but with one and two byte encodings in one type.
...
Yep. But IMHO the wording I proposed by (C): such as "object-alike" IMHO
leads to a more "understandable" definition, in effect providing
identical (or at least very similar) results and i.e. to at most an
identical implementation, as most of the differences might be considered
"implementation depending" (not defined by the pure, documented
definition of the behavior (such as what happens with "intersexual"
variables).
> Note that the Delphi2009 definition is theoretically capable of combining one and
> two bytes in one type (like Yury's).
As I don't have such a Delphi please help me to understand:
Is there a general type dedicated for being able to hold any encoding ?
(be it ANSIxyz, UTF-8 or UTF-16) ?
Of course, when assigning something to a "strictly encoded" String (the
type denotes the encoding) the definition of what is supposed to happen
is clear and obvious. If the Type name or the dynamic encoding of the
target (even if Length=0) is used for deciding about a conversion is an
"implementation detail".
Is there a clear definition about what happens if the "general" string
type is the target ? Here, IMHO, it would be very hard to understand, if
the history of the target variable (i.e. has a string of some encoding
been assigned to it before) would decide about a conversion. IMHO a
General string type needs to be handled as fully dynamically encoded and
thus as a target always needs to get the source's encoding.
Such "assignment" can happen with ":=", and with function calls. With
function calls there is "value" and "var" parameters. All this should
behave identical, any other behavior would be very hard to understand.
And on top of this: what is the type "String" ? Of course the general
String type would be an obvious choice, but perhaps (depending on the
implementation) this might result in worse performance in certain cases
of usage and thus some strict (specifically encoded) Type could be
chosen. (In fact I will never again use "String" in any project, but use
a propriety type defined in some central unit so that I at any time can
do a central change to some specific string type.)
> Embarcadero kept the two types separate,
Making a decently clear definition of the behavior (from a user's view)
rather complicated.
> - backwards compatibility (and thus the hurdle to upgrade)
This did not seem to have worked. Everybody, I asked, who migrated a
large project to the new strings, was very unhappy.
>
> Explain parent-child for explicitely this context. This kind of stuff is
> what I meant with self contained. Don't use terms that you don't fully
> describe elsewhere.
Sorry that I seemingly failed with my intention to help understanding
what I meant by stating the similarity to the objects' parent-child
relationship.
I just meant a "General" (or "Raw") string type needs to exist that can
hold any encoding and needs no conversion when a strictly encoded
variable is assigned to it (via ":=", value parameter or var parameter).
Similar as with a parent object it "is" any strictly encoded string
type, so that when using it as a nominal parameter of a function, it can
- without conversion - take any strict string type (and of course the
general type, too).
Similar as with an object's runtime type (such as via "is" and "as"),
the encoding of a General string can be detected and handled when
appropriate (e.g. when combining with a another (strict or general)
string or assigning to a strict string variable might request for
conversion).
>> the RAW string type and the types supposed to hold a specific encoding.
> Explain RAW.
See above. "General" or "not Strict" would be more appropriate (I took
the term "Raw" from other recent discussions on the issue.)
> Yes, I never really considered (B) a workable solution. It would break
> existing code, and the ways to deal with the other problems was hackish at
> best.
Yep. But I was told by unhappy coders that the new Delphi way breaks a
lot of existing code, as well. So a new FPC way has a chance of being
better. :) This might (or might not) be a way to do this.
> I think the A-B hybrid is better than either A or B. And that is what is
> being implemented.
Yep. Only the definition of it's behavior of course is a lot more
complex. In fact with "C" I tried (and failed) to find a proper basic
definition of exactly this.
> Then describe how that should work. What should happen if I pass such marked
> raw string to a function that wants encoding<y>?
I hope I did this some lines up. But better see below
>> So IMHO the Parent-Child (alike) relationship between RAW and any other
>> new string type is quite obvious.
> No it is not. And you don't make the situation any clearer by writing yet
> another message without a concrete description (either using specs or with
> examples), and not defining RAW and exactly how the parent-child relation
> works.
I hope I did this some lines up. But better see below
>
> I've been doing OOP for 15 years now, but I've no idea whatsoever.
Obviously it was not as a good idea as I thought to state the similarity
between the relation between a single "General" and multiple "strictly
encoded" string types regarding a Parent object and multiple Child
objects. But I am not at all against dropping this analogy and just
using a "self-contained" definition. Moreover I think we agree upon
dropping the term "Raw" for the general string type.
So the wording could be similar to:
- There is a General String type that can hold any encoding (and any
width of the code elements)
- There are lots of Strict String types that are supposed to hold
strings in a predefined encoding (somebody else might describe in detail
how these are defined)
- There are the appropriate single-character types corresponding to
all of the above string types
- A variable of the General string type or the General character type
only has a defined encoding if it before has been the target of an
assignment of a not empty string or a character with defined encoding.
- If just using strict types the conversion rules are obvious
- If assigning a value to a General string or character, (via ":=",
value or var parameter) no conversion is done.
- There are means to detect the actual encoding of a general String or
Character variable.
- If combining any Strings/Chars with General String or Chars, the
coding of the General ones is fetched from their embedded dynamic
encoding definition to decide upon conversion.
I hope this is more like what you'd like to see.
Note that the recent discussion about how variable passing with RAW /
not RAW strings is implemented might be decided by such a definition.
Note that Delphi seemingly introduced the encoding types $0000 for
"None"/"to be assigned" and $FFFF" for "Raw". This allows for a string
variable to be "General" or "Raw" with different meaning. How use this
to implement a proper handling of General variables that hold a certain
encoding but still are strictly General so that they get a different
encoding with the next assignment ? I don't see of / if / if not this
is helps implementing the above definition of if this or if it is a
contradiction to same and/or provides nasty ambiguity.
> Of course you can try to create some object based stringtype like C++, but
> then you will have to deal with all its problems, and the fact that Pascal's
> object model is not the same as C++'s. Also stuff that we take for granted
> (like copy-on-write) would be hard.
Of course I agree.
Thanks,
-Michael
More information about the fpc-devel
mailing list