[fpc-devel] Unicode proceedings

Wed Nov 16 17:24:24 CET 2011

In our previous episode, Michael Schnell said:
> > Then there were fully dynamic encoding schemes proposed too (e..g by
> > Florian).
> I do know this. There seems pros and cons have been discussed, but no 
> real decision has been done (i.e. a strict independent understandable 
> definition of what is supposed to be provided from the view of the user 
> - not from the view of the implementation).

There are no decisions about what of the implemented (D2009+) types will be
"string" and when (mode and/or OS, multi RTL or not).  

Not about what the new stringtypes will be, they will be D2009+ compatible
to a very high degree.

> > I'm not sure what you meant. I only saw what you wrote. You created three
> > classes of stringtypes, but in reality the implementation is hybrid of the
> > first two (the (C) bit was a bit a joke, since rawbytestring is so limited)
> Sorry if I have not been clear enough. I suggested three (mutually 
> exclusive) alternative ways to define a (supposedly clearly definable 
> and workable) family of String types (OK, "B" suggests a 
> one-member-family).

Well, since you say you read the original discussions, this is how they went
down in my memory:

The original proposal was like (A) but only for base unicode encodings
(utf8/16 and maybe 32), but went down due to either excess conversions and
need for overloading.  The amount of overloading for the current 3-4
stringtypes is already a bit much.  (short/ansi/wide/unicodestring)

The Delphi2009 system might actually reduce overloads in 2.8 compared to 2.6
because some ansistring and unicodestring overloading versions can be
combined using rawbytestring.

Moreover, overloading only works for simple functions that take a string. It
doesn't solve other spots where the declaration matters, e.g.  as parameters
of virtual methods, VAR/out parameters, typecasting of stringtypes to
pointer types, the need to add additional, non compatible members to
VARIANT etc etc.

(B) was a counter proposal floated by Florian. The cons were pretty much
that you had to guard every encoding sensitive routine (e.g.  every API/OS
call) to enforce the string contained the encoding you expected.  Combining
one and two byte types also cast doubt on the [] operator's performance. 
Some half-baked requests to let [] return codepoints were swiftly debated
and rejected.

Some damage control measures wrt the guarding were proposed, mostly based on
directives, which I didn't consider very convincing.

Then Yury proposed to combine A and B, in retrospect a bit like the current
Delphi implementation but with one and two byte encodings in one type.

The discussion then died out, but not long after that, the embargo on Delphi
2009 was lifted, and more details crept out.

Note that the Delphi2009 definition is theoretically capable of combining one and
two bytes in one type (like Yury's). Afaik there is no consensus why
Embarcadero kept the two types separate, though I can think of several
reasons:

- performance
- backwards compatibility (and thus the hurdle to upgrade)
- While normal code would probably work with an unified type, the big
  amounts of code that typecast strings, mess with temps or use strings as
  buffers would cause problems. Maybe they tried, and it was problematic.

> > As far as I know C has no object type.
> Correct. That is why I wrote "object-alike". This wording only 
> illustrates that there is a kind of Parent-Child relationship between 

Explain parent-child for explicitely this context. This kind of stuff is
what I meant with self contained. Don't use terms that you don't fully
describe elsewhere. 

> the RAW string type and the types supposed to hold a specific encoding. 

Explain RAW.

> With this in mind, it is obvious that when a function uses a "RAW" as a 
> parameter, no conversion is done, as RAW "is" any specific String type.

> > What you want with the whole (C)
> > branch is a complete mystery to me.
> Obviously a single fully dynamically encoded string type (such as B) is 
> not what everybody (but Florian an myself) wants, (e.g. because EMB 
> decided implementing multiple types with different names).

Yes, I never really considered (B) a workable solution. It would break
existing code, and the ways to deal with the other problems was hackish at
best.

> Obviously a system of hard coded string types (such as A) is not what 
> everybody (but some) wants (e.g. as there would need a lot of such types 
> and because EMB decided implementing dynamic typing).

I think the A-B hybrid is better than either A or B. And that is what is
being implemented.

> Obviously there is a request for variables that don't require a 
> dedicated encoding (RAW) _and_ for variables that define their encoding 
> by the name of their type.

Then describe how that should work. What should happen if I pass such marked
raw string to a function that wants encoding <y>? 

> So IMHO the Parent-Child (alike) relationship between RAW and any other 
> new string type is quite obvious.

No it is not. And you don't make the situation any clearer by writing yet
another message without a concrete description (either using specs or with
examples), and not defining RAW and exactly how the parent-child relation
works.

> Thus I feel that it is a good idea that - wherever sensible - the 
> behavior of the string types should be similar to that of object 
> regarding the Parent-Child relationship. This is especially true with 
> function parameters. (Recently there has been such a "fruitless" 
> discussion on how the different string types should behave when used as 
> formal/actual parameters)

This is total gibberish to me. 

> With this definition (that should be easily understandable for those who 
> are trained for working with object inheritance) you can use strict 
> typing with clearly defined effects (conversion when different type 
> names are combined), and (using RAW) you can take advantage of the 
> dynamic encoding  (conversion "whenever necessary") with clearly defined 
> effects as well.

I've been doing OOP for 15 years now, but I've no idea whatsoever.

Of course you can try to create some object based stringtype like C++, but
then you will have to deal with all its problems, and the fact that Pascal's
object model is not the same as C++'s. Also stuff that we take for granted
(like copy-on-write) would be hard.

I'm not against debating about that, but then there must be something to
actually debate about, not a mere suggestion that a bit of OOP will make all
pain go away....

> The definition of what happens when RAW and it's "children" are combined 
> also is quite obvious:
> 
>   - function parameters:
>     - strict -> RAW: no conversion
>     - RAW -> Strict: conversion whenever necessary

If I don't have the encoding of raw, how can I convert it to strict?

..... At this point I gave up.