[fpc-devel] Unicode proceedings
Michael Schnell
mschnell at lumino.de
Tue Nov 15 12:41:06 CET 2011
Here, there have been lots of long winding and partly quite fruitless
discussions on the implementation of the new Unicode aware string type(s).
IMHO, before trying to decide regarding any implementation details,
there should be a _very_explicit_ decision on the general functionality.
Here, I would suggest to first decide between the (IMHO) three sensible
mutually exclusive variants:
A)
Have the new string type(s) maintain the very strict typing paradigm
Pascal usually imposes. This IMHO asks for multiple explicit string
types such as ANSI_1252_String, UTF8_String, UTF16_String, etc plus the
appropriate single-character types. This of course would allow for
automatic conversion without ambiguity. As the types are mutually
exclusive, function calls always will do automatic conversions, unless
the appropriate overloaded function is defined. These string types of
course will not include fields denoting their encoding and byte count
per code element.
B)
Only do a single string type that is decently dynamically typed. These
strings of course will include fields denoting their encoding and byte
count per code element. Here conversions will happen when two
differently typed strings are combined in some operation. An empty
string would be handled as having no predefined encoding, so that
combining it with any other string will not force a conversion. As there
is only one string type, function calls will never trigger a conversion
and the encoding of the function result is not predefined. Of course a
single character type is defined that can hold any encoding and
supposedly will be done in a way that it in fact is dynamically encoded
as well. To enforce that a string is provided in some definite encoding,
appropriate function (or compiler magic) is available.
C)
Handle these strings similar to Pascal-Objects that allow for
inheritance (and provide the appropriate operator-overloading). So there
is a "Parent" string type (aka RAW) that has no predefined encoding and
multiple "Child" types that define different encoding enforcements. As
the parent type of course needs to include fields denoting their
encoding and byte count per code element, the child types of course are
implemented in the same way (which in theory allows for "intersexual
strings that feature data encoded correctly but differently than the
type denotes). Here (for non-intersexual strings) conversion can be
handled in an unambiguous ways (using either the type <if it is not the
"Parent" type> or the dynamically given encoding) and a non RAW target
type of ":=" might request for yet another conversion. A function
definition can either use the Parent type (RAW) and so will not trigger
a conversion when being called or use one of the Child string types and
trigger an appropriate automatic conversion. Of course single character
types matching the Parent and all Child string types are necessary.
While neither A nor B is Delphi XE compatible in any way, C seems a bit
similar to what Emb does. But AFAIK, Delphi does not provide an
unambiguous, well defined and understandable paradigm (such as a
Object-like Parent/Child relationship) for the features of the different
string types. So the FPC team should be free do do a decent definition.
-Michael
More information about the fpc-devel
mailing list