<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 3/9/21 2:18 AM, Graeme Geldenhuys
via fpc-pascal wrote:<br>
</div>
<blockquote type="cite"
cite="mid:39700fe4-7808-48da-fa34-4a8b11f8a67c@geldenhuys.co.uk">
<pre class="moz-quote-pre" wrap="">
On 08/03/2021 7:49 pm, Jonas Maebe via fpc-pascal wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">It's not possible to safely use unicodestring without
knowing how 16bit unicode works. The compiler can't solve that.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I disagree. Java does just that! The issue is the assumption of using
array indexing into the a string. I guess developers should stop doing
that.
The important point is:
But developer should be able to use Unicode strings without needing
to know the is and outs of Unicode and UTF-16 encoding. At least
that's what's possible with Java and other languages.</pre>
</blockquote>
<p>Yes, you absolutely need to know the ins and outs of Unicode in
order to know how to extract the first character of a string.
First of all, what is a character? A UTF-16 code unit, a Unicode
code point or an extended grapheme cluster? Your Java code only
does the expected thing for a certain subset of characters. If you
write your code like that, you're going to think your code works,
but it would fail on strings with either non-BMP characters (if
you use charAt) or strings with combining characters (if you use
codePointAt). To split the string into user perceived characters
you need to do this in FPC trunk:</p>
<p>uses</p>
<p> graphemebreakproperty, fpwidestring;</p>
var
<p> EGC, S: UnicodeString;</p>
begin
<p> S := '<span class="js-about-item-abstr">💩</span>Хей,
помисли́ си!';<br>
</p>
<p> for EGC in
TUnicodeStringExtendedGraphemeClustersEnumerator.Create(S) do<br>
Writeln(EGC);<br>
</p>
<p>end;</p>
<p><br>
</p>
<p>Can Java do that? No, it appears it can't:</p>
<p><a class="moz-txt-link-freetext" href="https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java">https://stackoverflow.com/questions/40878804/how-to-count-grapheme-clusters-or-perceived-emoji-characters-in-java</a><br>
</p>
<p><br>
</p>
<p>Neither charAt, nor codePointAt will work for the 'и́'. CharAt
will also fail at '<span class="js-about-item-abstr">💩</span>'.
Please correct me if I'm wrong, I didn't test this in Java.</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:39700fe4-7808-48da-fa34-4a8b11f8a67c@geldenhuys.co.uk">
<pre class="moz-quote-pre" wrap="">FPC (and Delphi) really need to get with the times.</pre>
</blockquote>
<p>If by "get with the times" you mean always include the
fpwidestring unit and still produce less bloat than the JVM, then
sure, we can do that, but some people appreciate the flexibility
of choosing your own wide string manager or not including it for
programs that don't need it.</p>
<p>And for things like splitting a string into characters, you
really need to know what you're doing anyway, since a Unicode
codepoint very rarely corresponds to what users perceive as a
character.</p>
<p>Nikolay</p>
<br>
</body>
</html>