[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Thu Nov 27 19:29:55 CET 2014
Michael Schnell schrieb:
> On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote:
>>
>> An AnsiString consists of AnsiChar's. The *meaning* of these char's
>> (bytes) depends on their encoding, regardless of whether the used
>> encoding is or is not stored with the string.
> I understand that the implementation (in Delphi) seems to be driven more
> by the Wording ("ANSI") than by the logical paradigm the language syntax
> suggests. The language syntax and the string header fields suggest that
> both the element-size as the code-ID-number need to be adhered to (be it
> statically or dynamically - depending on the usage instance). E.g. there
> are (are least two "Code pages" for UTF-16 ("LE", and "BE"), that would
> be worth supporting.
You are confusing codepages and encodings :-(
UTF-7, UTF-8, UTF-16 and UTF-16BE describe different representations of
the same values (Unicode codepoints). And I agree, all commonly used
encodings should be implemented, at least for data import/export.
>> It's essential to distinguish between low-level (physical) AnsiChar
>> values, and *logical* characters possibly consisting of multiple
>> AnsiChars.
> I now do see that the implementation is done following this concept. But
> the language syntax and the string header field suggest a more versatile
> paradigm, providing a universal reference counting "element string" type.
See it as a multi-level protocol for text processing. The bottom
(physical) level deals with physical storage items (AnsiChar,
WideChar...), and how they are stored in memory or files. Like it
doesn't make sense to deal with individual bytes of real numbers in
computations, it doesn't make sense to deal with individual bytes
(AnsiChars) of logical characters - except in type/encoding conversions.
Higher levels deal with logical values, which can consist of multiple
physical items, and may need different interpretatons (in case of Ansi
codepages). This level is partially coverd now by AnsiString encodings
and UTF-16 surrogate pairs, which allow to map the values into full
Unicode (UCS-4) codepoints. But these codepoints still are not
sufficient for a correct interpretation and manipulation of logical
characters, which again can consist of multiple codepoints (decomposed
umlauts, ligatures...). In a next level another (mostly language
specific) interpretation may be required, like which logical characters
have to be treated together (ligatures, non-breaking characters...).
Some natural languages (Hebrew, Arabic...) require another special
handling of (mixed) LTR/RTL reading, and of "paths", influencing the
graphical representation of character sequences; but that's nothing an
application or library writer should have to deal with, such
functionality should be provided by the target platform.
There must be a boundary between the standard (RTL) handling of the
physical items and encodings, and higher text processing levels, up to
language specific processing (how to break words, when to apply
capitalization, syntax checks...), so that such special handling can be
implemented in dedicated extensions (libraries, classes), by developers
familiar with the rules and conventions of the natural languages.
For now we are talking only about the handling up to individual Unicode
codepoints, and related string manipulation. Herefore at least one
string representation must exist, that covers the full Unicode range of
codepoints (UTF-8 or UTF-16 for now). When such an implementation claims
for "undefined" behaviour, then this can only mean implementation flaws,
resulting in something different from what can be expected from proper
Unicode handling. This includes invalid parameter values in subroutine
calls, which should result in proper (defined) runtime error reporting
(AV, error result...).
WRT to AnsiString encodings, the only acceptable (expected) differences
can result from lossy conversions, when converting proper Unicode into a
non-UTF encoding. Even then the results should be consistent, even if
the concrete results depend on some external (platform...) convention or
settings.
IMO.
>> That's why I wonder *when* exactly the result of such an expression
>> *is* converted (implicitly) into the static encoding of the target
>> variable, and when *not*.
> I understand that the idea is, to use the static encoding information
> provided by the type definition whenever possible.
Right, but here "whenever possible" depends on the correspondence of
static and dynamic encoding. When the dynamic encoding can *ever* be
different from the static encoding, except for RawByteString, I consider
it NOT possible to derive the need for a conversion from the static
encoding. In the handling of floatingpoint values we may have to expect
invalid operations (division by zero, overflow...) or values (NaN...),
but NOT that a Double variable ever contains two Integer values - unless
forced by dirty hacks out of compiler control. Why should this be
different and acceptable with string types?
> In Delphi the use of the dynamic encoding information seems to be very
> rare (and the implementation does not make much sense to me).
It's known that the Delphi AnsiString implementation is flawed, with
possibly different results when the same expression is based on
AnsiString or UnicodeString operands. But the same IMO is unacceptable
in FPC, *unless* the user has the same choice, between a proper and safe
(maybe slower), and another error prone and dangerous (maybe faster),
string expression evaluation.
> My hope was, that fpc might be able to correct this error of the Delphi
> compiler coders. But of course for Delphi compatibility the type name
> RawByteString and the code-ID-number $FFFF can't be used any more, but
> a new naming and ID number would need to be invented. IMHO this in fact
> is possible and viable (see wiki page for details).
I see no problem in using the same names and values. Delphi documents
clearly state:
>>
RawByteString should only be used as a parameter type, and only in
routines which otherwise would need multiple overloads for AnsiStrings
with different codepages. Such routines need to be written with care for
the actual codepage of the string at run time.
In general, it is recommended that string processing routines should
simply use "string" as the string type. Declaring variables or fields of
type RawByteString should rarely, if ever, be done, because this
practice can lead to undefined behavior and potential data loss.
<<
Where is specified that no conversion occurs, when a RawByteString is
assigned *to* a variable of a different encoding?
DoDi
More information about the fpc-devel
mailing list