[fpc-devel] Unicode proceedings

Michael Schnell mschnell at lumino.de
Tue Nov 15 12:41:06 CET 2011


Here, there have been lots of long winding and partly quite fruitless 
discussions on the implementation of the new Unicode aware string type(s).

IMHO, before trying to decide regarding any implementation details, 
there should be a _very_explicit_ decision on the general functionality.

Here, I would suggest to first decide between the (IMHO) three sensible 
mutually exclusive variants:


A)
Have the new string type(s) maintain the very strict typing paradigm 
Pascal usually imposes. This IMHO asks for multiple explicit string 
types such as ANSI_1252_String, UTF8_String, UTF16_String, etc plus the 
appropriate single-character types. This of course would allow for 
automatic conversion without ambiguity. As the types are mutually 
exclusive, function calls always will do automatic conversions, unless 
the appropriate overloaded function is defined. These string types of 
course will not include fields denoting their encoding and byte count 
per code element.

B)
Only do a single string type that is decently dynamically typed. These 
strings of course will include fields denoting their encoding and byte 
count per code element. Here conversions will happen when two 
differently typed strings are combined in some operation. An empty 
string would be handled as having no predefined encoding, so that 
combining it with any other string will not force a conversion. As there 
is only one string type, function calls will never trigger a conversion 
and the encoding of the function result is not predefined. Of course a 
single character type is defined that can hold any encoding and 
supposedly will be done in a way that it in fact is dynamically encoded 
as well. To enforce that a string is provided in some definite encoding, 
appropriate function (or compiler magic) is available.

C)
Handle these strings similar to Pascal-Objects that allow for 
inheritance (and provide the appropriate operator-overloading). So there 
is a "Parent" string type (aka RAW) that has no predefined encoding and 
multiple "Child" types that define different encoding enforcements. As 
the parent type of course needs to  include fields denoting their 
encoding and byte count per code element, the child types of course are 
implemented in the same way (which in theory allows for "intersexual 
strings that feature data encoded correctly but differently than the 
type denotes). Here (for non-intersexual strings) conversion can be 
handled in an unambiguous ways (using either the type <if it is not the 
"Parent" type> or the dynamically given encoding) and a non RAW target 
type of ":=" might request for yet another conversion. A function 
definition can either use the Parent type (RAW) and so will not trigger 
a conversion when being called or use one of the Child string types and 
trigger an appropriate automatic conversion. Of course single character 
types matching the Parent and all Child string types are necessary.

While neither A nor B is Delphi XE compatible in any way, C seems a bit 
similar to what Emb does. But AFAIK, Delphi does not provide an 
unambiguous, well defined and understandable paradigm (such as a 
Object-like Parent/Child relationship) for the features of the different 
string types. So the FPC team should be free do do a decent definition.

-Michael



More information about the fpc-devel mailing list