[fpc-pascal] Unicode chars losing information

Tue Mar 9 03:08:03 CET 2021

On 3/9/21 2:18 AM, Graeme Geldenhuys via fpc-pascal wrote:
> On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote:
>> It's not possible to safely use unicodestring without
>> knowing how 16bit unicode works. The compiler can't solve that.
> I disagree. Java does just that! The issue is the assumption of using
> array indexing into the a string. I guess developers should stop doing
> that.
>
> The important point is:
> But developer should be able to use Unicode strings without needing
> to know the is and outs of Unicode and UTF-16 encoding. At least
> that's what's possible with Java and other languages.

Yes, you absolutely need to know the ins and outs of Unicode in order to 
know how to extract the first character of a string. First of all, what 
is a character? A UTF-16 code unit, a Unicode code point or an extended 
grapheme cluster? Your Java code only does the expected thing for a 
certain subset of characters. If you write your code like that, you're 
going to think your code works, but it would fail on strings with either 
non-BMP characters (if you use charAt) or strings with combining 
characters (if you use codePointAt). To split the string into user 
perceived characters you need to do this in FPC trunk:

uses

   graphemebreakproperty, fpwidestring;

var

   EGC, S: UnicodeString;

begin

   S := '💩Хей, помисли́ си!';

   for EGC in TUnicodeStringExtendedGraphemeClustersEnumerator.Create(S) do
     Writeln(EGC);

end;

Can Java do that? No, it appears it can't:

https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java

Neither charAt, nor codePointAt will work for the 'и́'. CharAt will also 
fail at '💩'. Please correct me if I'm wrong, I didn't test this in Java.

> FPC (and Delphi) really need to get with the times.

If by "get with the times" you mean always include the fpwidestring unit 
and still produce less bloat than the JVM, then sure, we can do that, 
but some people appreciate the flexibility of choosing your own wide 
string manager or not including it for programs that don't need it.

And for things like splitting a string into characters, you really need 
to know what you're doing anyway, since a Unicode codepoint very rarely 
corresponds to what users perceive as a character.

Nikolay

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20210309/16a35631/attachment.htm>