[fpc-devel] Unicode proceedings
Michael Schnell
mschnell at lumino.de
Fri Nov 18 11:11:17 CET 2011
On 11/17/2011 02:55 PM, Sven Barth wrote:
> Am 17.11.2011 12:59, schrieb Michael Schnell:
>>> Note that the Delphi2009 definition is theoretically capable of
>>> combining one and
>>> two bytes in one type (like Yury's).
>> As I don't have such a Delphi please help me to understand:
>>
>> Is there a general type dedicated for being able to hold any encoding ?
>> (be it ANSIxyz, UTF-8 or UTF-16) ?
>
> In theory the AnsiString type (which is now the code page aware string
> type) should be capable of holding UTF-8 and UTF-16 data,
Why should a type that is capable of holding multiple different UTF
encodings be called "ANSIString". IMHO this is very contra-intuitive. I
think FPC should establish a better name (such as "GeneralString" or
similar). This would not harm Delphi compatibility as there could be a
type alias for this.
> but either the direct unconverted storage of 2 byte data (UTF-16) is
> forbidden or undefined (don't remember which one it is in Delphi).
What do you mean by unconverted ? What I mean is a type that just is
able to "be" any of the "Strict" Types and thus provides for fully
dynamic encoding for applications (function) that want to handle any
encoding by the same code sequence (being aware that they take the
appropriate conversion performance hit when combining differently
encoded strings).
>> Such "assignment" can happen with ":=", and with function calls. With
>> function calls there is "value" and "var" parameters. All this should
>> behave identical, any other behavior would be very hard to understand.
>
> Don't forget about "out". As it sets the string to empty I don't know
> by myself what Delphi does here (e.g. what codepage the string will
> contain).
Of course we need a decent definition for this case. As I never
intentionally used "out" parameters yet, I am not aware of the exact
implications, but I am sure that there is a way to do a decently
compatible definition.
>
> In Delphi the type "String" is an alias to "UnicodeString", thus a
> 2-byte string.
IMHO, predefining a type named UnicodeString to be encoded as UTF-16 is
contra-intuitive. I think FPC should establish a better naming (such as
UTF16String for something that is predefined to be coded that way, if it
in fact makes sense to define such a type in the language itself). For
Delphi compatibility type alializing could be used.
> In FPC there is no final decision yet and thus currently "String" is
> an "AnsiString" set to a specific codepage (though I honestly don't
> know which one it is...).
So I hope this discussion might help to promote a string Type
functionality and naming system that is better than that Delphi
currently provides.
I feel that - regarding the current state of the discussion - such types
should be defined (I don't intend to define the exact names by this, nor
to do any assumption on how to implement this):
- GeneralString (fully dynamic encoding can hold any encoding, 1, 2
and 4 byte code words, no conversion when used as a target of an
assignment, automatic conversion whenever necessary)
- RawByteString (on byte code words, never doing a conversion,
supposedly triggering an exception when combined with a variable that
requires a dedicated encoding)
- Raw Word String (two bytes code words, working like RawByteString)
- RawDWordString (four bytes code words, working like RawByteString)
- UTF8String (one byte code words, behavior is obvious)
- UTF16String (two byte code words, behavior is obvious)
- UTF32String (and/or UCS4String) (four byte code words, behavior is
obvious)
- ANSIString(n) (strictly encoded according to an ANSI code page, one
byte code words, behavior is obvious)
- ANSIStinrg(and/or LocaleString) = ANSIString(n) n defined by current
locale) This thingy should work very much alike the plain old "String",
even though the implementation is different.
and as a goody this could be implemented later:
- RawByteFIFOString (behaving exactly as RawByteString, but
implemented in a way that deleting from position 1 is much faster, while
any other operation might be slower)
- RawWordFIFOString (obvious)
- RawDWordFIFOString (obvious)
Moreover I feel that for some or all of these string types corresponding
character types should be provided. Otherwise I don't see how consistent
programming could be enabled. This obviously includes a dynamically
typed character type.
Moreover, IMHO, the meaning of the position aware functions
(MyString[i], pos(), copy(), delete(), ... ) should be reconsidered, to
allow the user to somehow declare his will to either work on
code-positions (fast) or on visual-character-positions (meaningful,
user-friendly).
(I don't think this is a great contradiction to what already is
implemented in the svn.)
-Michael
More information about the fpc-devel
mailing list