[fpc-pascal] Unicode chars losing information
Tomas Hajny
XHajT03 at hajny.biz
Mon Mar 8 23:26:24 CET 2021
On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote:
.
.
> In the example the index access should have returned a single
> codeunit, which was known to be a complete codepoint.
> As far as I understand the unexpected part was, that the unicode
> string did not contain the content of the string constant, because the
> assignment had caused an encoding conversion to be inserted.
> That conversion caused the need for a widestring manager.
>
> Maybe to help the search when/where and whatfor notes/warnings
> should/could be produced, those implicit conversions can be broken
> down into groups.
> I can think of 2 groups already.
> 1) Conversion due to explicit declared different encoding.
> AnAnsiString := SomeWideString;
> AnAsciiString := AnUtf8String; // declared as "type
> AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);"
Do you mean a compile-time warning? The trouble is that the compiler
wouldn't know whether a real widestring manager would get included in
the final binary when such conversions are encountered. And remember
that the final binary may be compiled at a different time from the
moment when the unit containing such conversions is compiled. In other
words, compile-time warnings would be rather difficult to implement. It
might be possible to error-out at runtime when such conversions are
invoked, but note that technically the conversion may not lead to
incorrect results if the string doesn't contain characters beyond
US-ASCII. In other word, a run-time error might be appropriate only if
the conversion encounters a character it cannot handle. However, adding
such a check would probably slow-down processing even for cases when the
strings don't contain any problematic characters.
> 2) Conversion where at least one string is not explicitly declared for
> a certain codepage.
> This should include indirection via $codepage
No, this is not the case. $codepage defines the source file encoding.
The compiler translates the string constants declared this way to a
UTF-16 constant stored within the compiled binary. Specifying $codepage
has no implications on runtime conversions by itself.
> Then maybe as a first step, a note/warning could be given, if a
> constant string is assigned to a variable, and a change of encoding is
> needed for this.
> - "constant string" here would be any string that does not have a
> direct explicit declared encoding.
> - This could be given, even if the presence/absence of a widestring
> manager is not known. Because
Because what?
> Obviously knowing the presence/absence of a widestring manager allows
> to refine warnings.
> But I guess that comes at a higher price, as each unit when compiled
> could only set flags in the ppu (including forwarding flags from used
> units).
> And the compiling the final program would read which warning flags are
> present, and if any unit flagged the inclusion of a widestring
> manager.
Yes, this would be indeed the only possibility.
Tomas
More information about the fpc-pascal
mailing list