[fpc-devel] Unicode proceedings

Michael Schnell mschnell at lumino.de
Fri Nov 18 11:11:17 CET 2011


On 11/17/2011 02:55 PM, Sven Barth wrote:
> Am 17.11.2011 12:59, schrieb Michael Schnell:
>>> Note that the Delphi2009 definition is theoretically capable of
>>> combining one and
>>> two bytes in one type (like Yury's).
>> As I don't have such a Delphi please help me to understand:
>>
>> Is there a general type dedicated for being able to hold any encoding ?
>> (be it ANSIxyz, UTF-8 or UTF-16) ?
>
> In theory the AnsiString type (which is now the code page aware string 
> type) should be capable of holding UTF-8 and UTF-16 data,
Why should a type that is capable of holding multiple different UTF 
encodings be called "ANSIString". IMHO this is very contra-intuitive. I 
think FPC should establish a better name (such as "GeneralString" or 
similar). This would not harm Delphi compatibility as there could be a 
type alias for this.

> but either the direct unconverted storage of 2 byte data (UTF-16) is 
> forbidden or undefined (don't remember which one it is in Delphi).
What do you mean by unconverted ? What I mean is a type that just is 
able to "be" any of the "Strict" Types and thus provides for fully 
dynamic encoding for applications (function) that want to handle any 
encoding by the same code sequence (being aware that they take the 
appropriate conversion performance hit when combining differently 
encoded strings).
>> Such "assignment" can happen with ":=", and with function calls. With
>> function calls there is "value" and "var" parameters. All this should
>> behave identical, any other behavior would be very hard to understand.
>
> Don't forget about "out". As it sets the string to empty I don't know 
> by myself what Delphi does here (e.g. what codepage the string will 
> contain).
Of course we need a decent definition for this case. As I never 
intentionally used "out" parameters yet, I am not aware of  the exact 
implications, but I am sure that there is a way to do a decently 
compatible definition.
>
> In Delphi the type "String" is an alias to "UnicodeString", thus a 
> 2-byte string.
IMHO, predefining a type named UnicodeString to be encoded as UTF-16 is 
contra-intuitive. I think FPC should establish a better naming (such as 
UTF16String for something that is predefined to be coded that way, if it 
in fact makes sense to define such a type in the language itself). For 
Delphi compatibility type alializing could be used.
> In FPC there is no final decision yet and thus currently "String" is 
> an "AnsiString" set to a specific codepage (though I honestly don't 
> know which one it is...).
So I hope this discussion might help to promote a string Type 
functionality and naming system that is better than that Delphi 
currently provides.


I feel that - regarding the current state of the discussion - such types 
should be defined (I don't intend to define the exact names by this, nor 
to do any assumption on how to implement this):

  - GeneralString (fully dynamic encoding can hold any encoding, 1, 2 
and 4 byte code words, no conversion when used as a target of an 
assignment, automatic conversion whenever necessary)

  - RawByteString (on byte code words, never doing a conversion, 
supposedly triggering an exception when combined with a variable that 
requires a dedicated encoding)

  - Raw Word String (two bytes code words, working like RawByteString)

  - RawDWordString (four bytes code words, working like RawByteString)

  - UTF8String (one byte code words, behavior is obvious)

  - UTF16String (two byte code words, behavior is obvious)

  - UTF32String (and/or UCS4String) (four byte code words, behavior is 
obvious)

  - ANSIString(n) (strictly encoded according to an ANSI code page, one 
byte code words, behavior is obvious)

  - ANSIStinrg(and/or LocaleString) = ANSIString(n) n defined by current 
locale) This thingy should work very much alike the plain old "String", 
even though the implementation is different.

and as a goody this could be implemented later:

  - RawByteFIFOString (behaving exactly as RawByteString, but 
implemented in a way that deleting from position 1 is much faster, while 
any other operation might be slower)
  - RawWordFIFOString (obvious)
  - RawDWordFIFOString (obvious)

Moreover I feel that for some or all of these string types corresponding 
character types should be provided. Otherwise I don't see how consistent 
programming could be enabled. This obviously includes a dynamically 
typed character type.

Moreover, IMHO, the meaning of the position aware functions 
(MyString[i], pos(), copy(), delete(), ... ) should be reconsidered, to 
allow the user to somehow declare his will to either work on 
code-positions (fast) or on visual-character-positions (meaningful, 
user-friendly).

(I don't think this is a great contradiction to what already is 
implemented in the svn.)

-Michael



More information about the fpc-devel mailing list