<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 3/9/21 2:18 AM, Graeme Geldenhuys

      via fpc-pascal wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:39700fe4-7808-48da-fa34-4a8b11f8a67c@geldenhuys.co.uk">

      <pre class="moz-quote-pre" wrap="">

On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">It's not possible to safely use unicodestring without

knowing how 16bit unicode works. The compiler can't solve that.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

I disagree. Java does just that! The issue is the assumption of using

array indexing into the a string. I guess developers should stop doing

that.

The important point is:

But developer should be able to use Unicode strings without needing

to know the is and outs of Unicode and UTF-16 encoding. At least

that's what's possible with Java and other languages.</pre>

    </blockquote>

    <p>Yes, you absolutely need to know the ins and outs of Unicode in

      order to know how to extract the first character of a string.

      First of all, what is a character? A UTF-16 code unit, a Unicode

      code point or an extended grapheme cluster? Your Java code only

      does the expected thing for a certain subset of characters. If you

      write your code like that, you're going to think your code works,

      but it would fail on strings with either non-BMP characters (if

      you use charAt) or strings with combining characters (if you use

      codePointAt). To split the string into user perceived characters

      you need to do this in FPC trunk:</p>

    <p>uses</p>

    <p>  graphemebreakproperty, fpwidestring;</p>

    var

    <p>  EGC, S: UnicodeString;</p>

    begin

    <p>  S := '<span class="js-about-item-abstr">💩</span>Хей,

      помисли́ си!';<br>

    </p>

    <p>  for EGC in

      TUnicodeStringExtendedGraphemeClustersEnumerator.Create(S) do<br>

          Writeln(EGC);<br>

    </p>

    <p>end;</p>

    <p><br>

    </p>

    <p>Can Java do that? No, it appears it can't:</p>

    <p><a class="moz-txt-link-freetext" href="https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java">https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java</a><br>

    </p>

    <p><br>

    </p>

    <p>Neither charAt, nor codePointAt will work for the 'и́'. CharAt

      will also fail at '<span class="js-about-item-abstr">💩</span>'.

      Please correct me if I'm wrong, I didn't test this in Java.</p>

    <p><br>

    </p>

    <blockquote type="cite"

      cite="mid:39700fe4-7808-48da-fa34-4a8b11f8a67c@geldenhuys.co.uk">

      <pre class="moz-quote-pre" wrap="">FPC (and Delphi) really need to get with the times.</pre>

    </blockquote>

    <p>If by "get with the times" you mean always include the

      fpwidestring unit and still produce less bloat than the JVM, then

      sure, we can do that, but some people appreciate the flexibility

      of choosing your own wide string manager or not including it for

      programs that don't need it.</p>

    <p>And for things like splitting a string into characters, you

      really need to know what you're doing anyway, since a Unicode

      codepoint very rarely corresponds to what users perceive as a

      character.</p>

    <p>Nikolay</p>

    <br>

  </body>

</html>