<html><body><div class="gmail_quote">

    <div dir="ltr" class="gmail_attr">On Dec 1, 2024 at 2:23:08 PM, Nikolay Nikolov via fpc-pascal <<a href="mailto:fpc-pascal@lists.freepascal.org">fpc-pascal@lists.freepascal.org</a>> wrote:<br></div>

    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" type="cite">

Here's how Free Pascal types map to Unicode terminology: WideChar = UTF-16 code unit UnicodeString = UTF-16 encoded string WideString = UTF-16 encoded string. On Windows it's not reference counted - used for COM compatibility. On other platforms, it's the same as UnicodeString. UTF8String = UTF-8 encoded string. Defined as UTF8String=type AnsiString(CP_UTF8). UTF16String = alias for UnicodeString Hope this clears things up. Another thing: For conversions between different encodings to work (e.g. between UTF-8 and UTF-16), you need to install a widestring manager. Some platforms (like Windows) always include one by default, but other platforms (e.g. Linux) don't, in order to reduce bloat, for programs that don't need it. For these, you may need to include unit cwstring or something like that.

    </blockquote>

</div>

<div dir="ltr"><br></div><div dir="ltr">Including that unit is sneaky, seems you need it anytime dealing with unicode. Not sure how it even knows to change the meaning of those character constants.</div><div dir="ltr"><br></div><div dir="ltr">Using the term “char” was maybe a mistake. This misleads people into thinking it’s a “character” as they perceive it but really it’s just a code point. Why isn’t there a “UnicodeChar” type which is 4 bytes and hold a full UTF-8 character? That’s probably what most people are expecting when they think “unicode” and “character”. Their are still compound characters which appear as one but actually multiple overlayed but still getting the component parts is useful.</div><div dir="ltr"><br></div><div dir="ltr">Choosing UTF-16 for UnicodeString was probably a mistake too. It’s my understanding all websites are UTF-8 which means this encoding will dominate everything. I think UTF-8  is by far the most used right?</div><div dir="ltr"><br></div><div dir="ltr">As a user I would expect to take a string constant and assigning it to a UnicodeString would let me iterate over UnicodeChar. That’s logical right?  Maybe this is just left undone as of now. I don’t know.</div><div dir="ltr"><br></div><div dir="ltr">var</div><div dir="ltr"><div dir="ltr">  u: UnicodeChar;</div><div dir="ltr">  s: UnicodeString;</div><div dir="ltr">begin</div><div dir="ltr">  s := 'Hello, 🌎!';</div><div dir="ltr">  for u in s do</div><div dir="ltr">    writeln(u);</div><div><br></div></div><div dir="ltr">

    <br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Regards,</div>    Ryan Joseph</div></div><br>

</div></body></html>