[fpc-pascal] Unicode chars losing information

Tomas Hajny XHajT03 at hajny.biz
Mon Mar 8 23:26:24 CET 2021


On 2021-03-08 21:36, Martin Frb via fpc-pascal wrote:
  .
  .
> In the example the index access should have returned a single
> codeunit, which was known to be a complete codepoint.
> As far as I understand the unexpected part was, that the unicode
> string did not contain the content of the string constant, because the
> assignment had caused an encoding conversion to be inserted.
> That conversion caused the need for a widestring manager.
> 
> Maybe to help the search when/where and whatfor notes/warnings
> should/could be produced, those implicit conversions can be broken
> down into groups.
> I can think of 2 groups already.
> 1) Conversion due to explicit declared different encoding.
>    AnAnsiString := SomeWideString;
>   AnAsciiString := AnUtf8String; // declared as "type
> AnsiString(CP_ASCII);" and "type AnsiString(CP_UTF8);"

Do you mean a compile-time warning? The trouble is that the compiler 
wouldn't know whether a real widestring manager would get included in 
the final binary when such conversions are encountered. And remember 
that the final binary may be compiled at a different time from the 
moment when the unit containing such conversions is compiled. In other 
words, compile-time warnings would be rather difficult to implement. It 
might be possible to error-out at runtime when such conversions are 
invoked, but note that technically the conversion may not lead to 
incorrect results if the string doesn't contain characters beyond 
US-ASCII. In other word, a run-time error might be appropriate only if 
the conversion encounters a character it cannot handle. However, adding 
such a check would probably slow-down processing even for cases when the 
strings don't contain any problematic characters.


> 2) Conversion where at least one string is not explicitly declared for
> a certain codepage.
>    This should include indirection via $codepage

No, this is not the case. $codepage defines the source file encoding. 
The compiler translates the string constants declared this way to a 
UTF-16 constant stored within the compiled binary. Specifying $codepage 
has no implications on runtime conversions by itself.


> Then maybe as a first step, a note/warning could be given, if a
> constant string is assigned to a variable, and a change of encoding is
> needed for this.
> - "constant string" here would be any string that does not have a
> direct explicit declared encoding.
> - This could be given, even if the presence/absence of a widestring
> manager is not known. Because

Because what?


> Obviously knowing the presence/absence of a widestring manager allows
> to refine warnings.
> But I guess that comes at a higher price, as each unit when compiled
> could only set flags in the ppu (including forwarding flags from used
> units).
> And the compiling the final program would read which warning flags are
> present, and if any unit flagged the inclusion of a widestring
> manager.

Yes, this would be indeed the only possibility.

Tomas


More information about the fpc-pascal mailing list