<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">Am 01.12.2024 um 14:37 schrieb Hairy

      Pixels via fpc-pascal:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Dec 1, 2024 at 2:23:08 PM,

          Nikolay Nikolov via fpc-pascal <<a

            href="mailto:fpc-pascal@lists.freepascal.org"

            moz-do-not-send="true" class="moz-txt-link-freetext">fpc-pascal@lists.freepascal.org</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote"

style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"

          type="cite"> Here's how Free Pascal types map to Unicode

          terminology:<br>

          <br>

          WideChar = UTF-16 code unit<br>

          <br>

          UnicodeString = UTF-16 encoded string<br>

          <br>

          WideString = UTF-16 encoded string. On Windows it's not

          reference<br>

          counted - used for COM compatibility. On other platforms, it's

          the same<br>

          as UnicodeString.<br>

          <br>

          UTF8String = UTF-8 encoded string. Defined as UTF8String=type<br>

          AnsiString(CP_UTF8).<br>

          <br>

          UTF16String = alias for UnicodeString<br>

          <br>

          Hope this clears things up.<br>

          <br>

          <br>

          Another thing:<br>

          <br>

          For conversions between different encodings to work (e.g.

          between UTF-8<br>

          and UTF-16), you need to install a widestring manager. Some

          platforms<br>

          (like Windows) always include one by default, but other

          platforms (e.g.<br>

          Linux) don't, in order to reduce bloat, for programs that

          don't need it.<br>

          For these, you may need to include unit cwstring or something

          like that. </blockquote>

      </div>

      <div dir="ltr"><br>

      </div>

      <div dir="ltr">Including that unit is sneaky, seems you need it

        anytime dealing with unicode. Not sure how it even knows to

        change the meaning of those character constants.</div>

    </blockquote>

    <br>

    There is nothing sneaky about this. This is simply how things work

    in FPC to avoid linking against the C-library (or including quite a

    load of Unicode data in case of fpwidestring instead of cwstring)

    when for much code it isn't necessary (just like the need to use

    unit cthreads on *nix-systems to install the threading manager).<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">

      <div dir="ltr">Using the term “char” was maybe a mistake. This

        misleads people into thinking it’s a “character” as they

        perceive it but really it’s just a code point.</div>

    </blockquote>

    <br>

    There isn't much choice here, cause that type name exists from old

    Pascal times and that will not change (well, okay, it will change in

    so far as when the Unicode RTL is enabled it will be Char = WideChar

    instead of Char = AnsiChar as it is now).<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">

      <div dir="ltr">Why isn’t there a “UnicodeChar” type which is 4

        bytes and hold a full UTF-8 character?</div>

    </blockquote>

    <br>

    There is, it's called UCS4Char. Also it's not a "full UTF-8

    character", but simply a "Unicode code point".<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">

      <div dir="ltr">Choosing UTF-16 for UnicodeString was probably a

        mistake too. </div>

    </blockquote>

    <br>

    Take that up with Borland, cause they termed it as "UnicodeString".

    Which is mainly because they only had to deal with Windows

    compatibility where there either were the single Byte encodings or

    the UTF-16 encoding.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">

      <div dir="ltr">It’s my understanding all websites are UTF-8 which

        means this encoding will dominate everything. I think UTF-8  is

        by far the most used right?</div>

    </blockquote>

    <br>

    UTF-8 is usually used for textual encoding, because it is the most

    memory dense Unicode encoding, however many languages or runtimes

    including JavaScript, Java's JVM, the .Net CLR, Windows, Qt, UEFI

    and Delphi >= 2009 use UTF-16 internally.<br>

    <br>

    <blockquote type="cite"

cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">

      <div dir="ltr">As a user I would expect to take a string constant

        and assigning it to a UnicodeString would let me iterate over

        UnicodeChar. That’s logical right?  Maybe this is just left

        undone as of now. I don’t know.</div>

      <div dir="ltr"><br>

      </div>

      <div dir="ltr">var</div>

      <div dir="ltr">

        <div dir="ltr">  u: UnicodeChar;</div>

        <div dir="ltr">  s: UnicodeString;</div>

        <div dir="ltr">begin</div>

        <div dir="ltr">  s := 'Hello, 🌎!';</div>

        <div dir="ltr">  for u in s do</div>

        <div dir="ltr">    writeln(u);</div>

      </div>

    </blockquote>

    <br>

    Here you go:<br>

    <br>

    === code begin ===<br>

    <br>

    program tstrenum;<br>

    <br>

    {$codepage utf8}<br>

    {$mode objfpc}{$H+}<br>

    {$modeswitch advancedrecords}<br>

    <br>

    type<br>

      TUCS4CharUnicodeStrEnumerator = record<br>

      private<br>

        fStr: UnicodeString;<br>

        fIndex: SizeInt;<br>

        fCurrent: UCS4Char;<br>

      public<br>

        constructor Create(const aStr: UnicodeString);<br>

        function MoveNext: Boolean;<br>

        property Current: UCS4Char read fCurrent;<br>

      end;<br>

    <br>

    constructor TUCS4CharUnicodeStrEnumerator.Create(const aStr:

    UnicodeString);<br>

    begin<br>

      fStr := aStr;<br>

      fIndex := -1;<br>

      fCurrent := 0;<br>

    end;<br>

    <br>

    function TUCS4CharUnicodeStrEnumerator.MoveNext: Boolean;<br>

    begin<br>

      Inc(fIndex);<br>

      if fIndex > Length(fStr) then<br>

        Exit(False);<br>

      if (Ord(fStr[fIndex]) >= $D800) and (Ord(fStr[fIndex]) <=

    $DBFF) then begin<br>

        if fIndex < High(fStr) then begin<br>

          if (Ord(fStr[fIndex + 1]) >= $DC00) and (Ord(fStr[fIndex +

    1]) <= $DFFF) then begin<br>

            fCurrent := UCS4Char(Ord(fStr[fIndex]) - $D800) shl 10 +

    UCS4Char(Ord(fStr[fIndex + 1])) - $DC00 + $10000;<br>

            Inc(fIndex);<br>

          end else<br>

            fCurrent := Ord(fStr[fIndex]);<br>

        end else<br>

          fCurrent := Ord(fStr[fIndex]);<br>

      end else<br>

        fCurrent := Ord(fStr[fIndex]);<br>

      Result := True;<br>

    end;<br>

    <br>

    operator Enumerator(const aStr: UnicodeString):

    TUCS4CharUnicodeStrEnumerator;<br>

    begin<br>

      Result := TUCS4CharUnicodeStrEnumerator.Create(aStr);<br>

    end;<br>

    <br>

    var<br>

      s: UnicodeString;<br>

      u: UCS4Char;<br>

    begin<br>

      s := 'Hello, 🌎!';<br>

      for u in s do<br>

        Writeln(HexStr(Ord(u), 8));<br>

    end.<br>

    <br>

    === code end ===<br>

    <br>

    Regards,<br>

    Sven<br>

  </body>

</html>