[fpc-pascal] Unicode chars losing information
Nikolay Nikolov
nickysn at gmail.com
Tue Mar 9 03:08:03 CET 2021
On 3/9/21 2:18 AM, Graeme Geldenhuys via fpc-pascal wrote:
> On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote:
>> It's not possible to safely use unicodestring without
>> knowing how 16bit unicode works. The compiler can't solve that.
> I disagree. Java does just that! The issue is the assumption of using
> array indexing into the a string. I guess developers should stop doing
> that.
>
> The important point is:
> But developer should be able to use Unicode strings without needing
> to know the is and outs of Unicode and UTF-16 encoding. At least
> that's what's possible with Java and other languages.
Yes, you absolutely need to know the ins and outs of Unicode in order to
know how to extract the first character of a string. First of all, what
is a character? A UTF-16 code unit, a Unicode code point or an extended
grapheme cluster? Your Java code only does the expected thing for a
certain subset of characters. If you write your code like that, you're
going to think your code works, but it would fail on strings with either
non-BMP characters (if you use charAt) or strings with combining
characters (if you use codePointAt). To split the string into user
perceived characters you need to do this in FPC trunk:
uses
graphemebreakproperty, fpwidestring;
var
EGC, S: UnicodeString;
begin
S := '💩Хей, помисли́ си!';
for EGC in TUnicodeStringExtendedGraphemeClustersEnumerator.Create(S) do
Writeln(EGC);
end;
Can Java do that? No, it appears it can't:
https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java
Neither charAt, nor codePointAt will work for the 'и́'. CharAt will also
fail at '💩'. Please correct me if I'm wrong, I didn't test this in Java.
> FPC (and Delphi) really need to get with the times.
If by "get with the times" you mean always include the fpwidestring unit
and still produce less bloat than the JVM, then sure, we can do that,
but some people appreciate the flexibility of choosing your own wide
string manager or not including it for programs that don't need it.
And for things like splitting a string into characters, you really need
to know what you're doing anyway, since a Unicode codepoint very rarely
corresponds to what users perceive as a character.
Nikolay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20210309/16a35631/attachment.htm>
More information about the fpc-pascal
mailing list