[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Thu Nov 27 19:29:55 CET 2014

Michael Schnell schrieb:
> On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote:
>>
>> An AnsiString consists of AnsiChar's. The *meaning* of these char's 
>> (bytes) depends on their encoding, regardless of whether the used 
>> encoding is or is not stored with the string.
> I understand that the implementation (in Delphi) seems to be driven more 
> by the Wording ("ANSI") than by the logical paradigm the language syntax 
> suggests. The language syntax and the string header fields suggest that 
> both the element-size as the code-ID-number need to be adhered to (be it 
> statically or dynamically - depending on the usage instance). E.g. there 
> are (are least two "Code pages" for UTF-16 ("LE", and "BE"), that would 
> be worth supporting.

You are confusing codepages and encodings :-(

UTF-7, UTF-8, UTF-16 and UTF-16BE describe different representations of 
the same values (Unicode codepoints). And I agree, all commonly used 
encodings should be implemented, at least for data import/export.

>> It's essential to distinguish between low-level (physical) AnsiChar 
>> values, and *logical* characters possibly consisting of multiple 
>> AnsiChars.
> I now do see that the implementation is done following this concept. But 
> the language syntax and the string header field suggest a more versatile 
> paradigm, providing a universal reference counting "element string" type.

See it as a multi-level protocol for text processing. The bottom 
(physical) level deals with physical storage items (AnsiChar, 
WideChar...), and how they are stored in memory or files. Like it 
doesn't make sense to deal with individual bytes of real numbers in 
computations, it doesn't make sense to deal with individual bytes 
(AnsiChars) of logical characters - except in type/encoding conversions. 
Higher levels deal with logical values, which can consist of multiple 
physical items, and may need different interpretatons (in case of Ansi 
codepages). This level is partially coverd now by AnsiString encodings 
and UTF-16 surrogate pairs, which allow to map the values into full 
Unicode (UCS-4) codepoints. But these codepoints still are not 
sufficient for a correct interpretation and manipulation of logical 
characters, which again can consist of multiple codepoints (decomposed 
umlauts, ligatures...). In a next level another (mostly language 
specific) interpretation may be required, like which logical characters 
have to be treated together (ligatures, non-breaking characters...). 
Some natural languages (Hebrew, Arabic...) require another special 
handling of (mixed) LTR/RTL reading, and of "paths", influencing the 
graphical representation of character sequences; but that's nothing an 
application or library writer should have to deal with, such 
functionality should be provided by the target platform.

There must be a boundary between the standard (RTL) handling of the 
physical items and encodings, and higher text processing levels, up to 
language specific processing (how to break words, when to apply 
capitalization, syntax checks...), so that such special handling can be 
implemented in dedicated extensions (libraries, classes), by developers 
familiar with the rules and conventions of the natural languages.

For now we are talking only about the handling up to individual Unicode 
codepoints, and related string manipulation. Herefore at least one 
string representation must exist, that covers the full Unicode range of 
codepoints (UTF-8 or UTF-16 for now). When such an implementation claims 
for "undefined" behaviour, then this can only mean implementation flaws, 
resulting in something different from what can be expected from proper 
Unicode handling. This includes invalid parameter values in subroutine 
calls, which should result in proper (defined) runtime error reporting 
(AV, error result...).

WRT to AnsiString encodings, the only acceptable (expected) differences 
can result from lossy conversions, when converting proper Unicode into a 
non-UTF encoding. Even then the results should be consistent, even if 
the concrete results depend on some external (platform...) convention or 
settings.

IMO.

>> That's why I wonder *when* exactly the result of such an expression 
>> *is* converted (implicitly) into the static encoding of the target 
>> variable, and when *not*.
> I understand that the idea is, to use the static encoding information 
> provided by the type definition whenever possible.

Right, but here "whenever possible" depends on the correspondence of 
static and dynamic encoding. When the dynamic encoding can *ever* be 
different from the static encoding, except for RawByteString, I consider 
it NOT possible to derive the need for a conversion from the static 
encoding. In the handling of floatingpoint values we may have to expect 
invalid operations (division by zero, overflow...) or values (NaN...), 
but NOT that a Double variable ever contains two Integer values - unless 
forced by dirty hacks out of compiler control. Why should this be 
different and acceptable with string types?

> In Delphi the use of the dynamic encoding information seems to be very 
> rare (and the implementation does not make much sense to me).

It's known that the Delphi AnsiString implementation is flawed, with 
possibly different results when the same expression is based on 
AnsiString or UnicodeString operands. But the same IMO is unacceptable 
in FPC, *unless* the user has the same choice, between a proper and safe 
(maybe slower), and another error prone and dangerous (maybe faster), 
string expression evaluation.

> My hope was, that fpc might be able to correct this error of the Delphi 
> compiler coders. But of course for Delphi compatibility the type name 
> RawByteString and the code-ID-number  $FFFF can't be used any more, but 
> a new naming and ID number would need to be invented. IMHO this in fact 
> is possible and viable (see wiki page for details).

I see no problem in using the same names and values. Delphi documents 
clearly state:

 >>
RawByteString should only be used as a parameter type, and only in 
routines which otherwise would need multiple overloads for AnsiStrings 
with different codepages. Such routines need to be written with care for 
the actual codepage of the string at run time.

In general, it is recommended that string processing routines should 
simply use "string" as the string type. Declaring variables or fields of 
type RawByteString should rarely, if ever, be done, because this 
practice can lead to undefined behavior and potential data loss.
<<

Where is specified that no conversion occurs, when a RawByteString is 
assigned *to* a variable of a different encoding?

DoDi