<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Am 01.12.2024 um 14:37 schrieb Hairy
Pixels via fpc-pascal:<br>
</div>
<blockquote type="cite"
cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Dec 1, 2024 at 2:23:08 PM,
Nikolay Nikolov via fpc-pascal <<a
href="mailto:fpc-pascal@lists.freepascal.org"
moz-do-not-send="true" class="moz-txt-link-freetext">fpc-pascal@lists.freepascal.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"
type="cite"> Here's how Free Pascal types map to Unicode
terminology:<br>
<br>
WideChar = UTF-16 code unit<br>
<br>
UnicodeString = UTF-16 encoded string<br>
<br>
WideString = UTF-16 encoded string. On Windows it's not
reference<br>
counted - used for COM compatibility. On other platforms, it's
the same<br>
as UnicodeString.<br>
<br>
UTF8String = UTF-8 encoded string. Defined as UTF8String=type<br>
AnsiString(CP_UTF8).<br>
<br>
UTF16String = alias for UnicodeString<br>
<br>
Hope this clears things up.<br>
<br>
<br>
Another thing:<br>
<br>
For conversions between different encodings to work (e.g.
between UTF-8<br>
and UTF-16), you need to install a widestring manager. Some
platforms<br>
(like Windows) always include one by default, but other
platforms (e.g.<br>
Linux) don't, in order to reduce bloat, for programs that
don't need it.<br>
For these, you may need to include unit cwstring or something
like that. </blockquote>
</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">Including that unit is sneaky, seems you need it
anytime dealing with unicode. Not sure how it even knows to
change the meaning of those character constants.</div>
</blockquote>
<br>
There is nothing sneaky about this. This is simply how things work
in FPC to avoid linking against the C-library (or including quite a
load of Unicode data in case of fpwidestring instead of cwstring)
when for much code it isn't necessary (just like the need to use
unit cthreads on *nix-systems to install the threading manager).<br>
<br>
<blockquote type="cite"
cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">
<div dir="ltr">Using the term “char” was maybe a mistake. This
misleads people into thinking it’s a “character” as they
perceive it but really it’s just a code point.</div>
</blockquote>
<br>
There isn't much choice here, cause that type name exists from old
Pascal times and that will not change (well, okay, it will change in
so far as when the Unicode RTL is enabled it will be Char = WideChar
instead of Char = AnsiChar as it is now).<br>
<br>
<blockquote type="cite"
cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">
<div dir="ltr">Why isn’t there a “UnicodeChar” type which is 4
bytes and hold a full UTF-8 character?</div>
</blockquote>
<br>
There is, it's called UCS4Char. Also it's not a "full UTF-8
character", but simply a "Unicode code point".<br>
<br>
<blockquote type="cite"
cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">
<div dir="ltr">Choosing UTF-16 for UnicodeString was probably a
mistake too. </div>
</blockquote>
<br>
Take that up with Borland, cause they termed it as "UnicodeString".
Which is mainly because they only had to deal with Windows
compatibility where there either were the single Byte encodings or
the UTF-16 encoding.<br>
<br>
<blockquote type="cite"
cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">
<div dir="ltr">It’s my understanding all websites are UTF-8 which
means this encoding will dominate everything. I think UTF-8 is
by far the most used right?</div>
</blockquote>
<br>
UTF-8 is usually used for textual encoding, because it is the most
memory dense Unicode encoding, however many languages or runtimes
including JavaScript, Java's JVM, the .Net CLR, Windows, Qt, UEFI
and Delphi >= 2009 use UTF-16 internally.<br>
<br>
<blockquote type="cite"
cite="mid:CAGsUGtnogukyZZrN149eh+jYHM5UhQufYj9_=hxFrTAEDgiG1w@mail.gmail.com">
<div dir="ltr">As a user I would expect to take a string constant
and assigning it to a UnicodeString would let me iterate over
UnicodeChar. That’s logical right? Maybe this is just left
undone as of now. I don’t know.</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">var</div>
<div dir="ltr">
<div dir="ltr"> u: UnicodeChar;</div>
<div dir="ltr"> s: UnicodeString;</div>
<div dir="ltr">begin</div>
<div dir="ltr"> s := 'Hello, 🌎!';</div>
<div dir="ltr"> for u in s do</div>
<div dir="ltr"> writeln(u);</div>
</div>
</blockquote>
<br>
Here you go:<br>
<br>
=== code begin ===<br>
<br>
program tstrenum;<br>
<br>
{$codepage utf8}<br>
{$mode objfpc}{$H+}<br>
{$modeswitch advancedrecords}<br>
<br>
type<br>
TUCS4CharUnicodeStrEnumerator = record<br>
private<br>
fStr: UnicodeString;<br>
fIndex: SizeInt;<br>
fCurrent: UCS4Char;<br>
public<br>
constructor Create(const aStr: UnicodeString);<br>
function MoveNext: Boolean;<br>
property Current: UCS4Char read fCurrent;<br>
end;<br>
<br>
constructor TUCS4CharUnicodeStrEnumerator.Create(const aStr:
UnicodeString);<br>
begin<br>
fStr := aStr;<br>
fIndex := -1;<br>
fCurrent := 0;<br>
end;<br>
<br>
function TUCS4CharUnicodeStrEnumerator.MoveNext: Boolean;<br>
begin<br>
Inc(fIndex);<br>
if fIndex > Length(fStr) then<br>
Exit(False);<br>
if (Ord(fStr[fIndex]) >= $D800) and (Ord(fStr[fIndex]) <=
$DBFF) then begin<br>
if fIndex < High(fStr) then begin<br>
if (Ord(fStr[fIndex + 1]) >= $DC00) and (Ord(fStr[fIndex +
1]) <= $DFFF) then begin<br>
fCurrent := UCS4Char(Ord(fStr[fIndex]) - $D800) shl 10 +
UCS4Char(Ord(fStr[fIndex + 1])) - $DC00 + $10000;<br>
Inc(fIndex);<br>
end else<br>
fCurrent := Ord(fStr[fIndex]);<br>
end else<br>
fCurrent := Ord(fStr[fIndex]);<br>
end else<br>
fCurrent := Ord(fStr[fIndex]);<br>
Result := True;<br>
end;<br>
<br>
operator Enumerator(const aStr: UnicodeString):
TUCS4CharUnicodeStrEnumerator;<br>
begin<br>
Result := TUCS4CharUnicodeStrEnumerator.Create(aStr);<br>
end;<br>
<br>
var<br>
s: UnicodeString;<br>
u: UCS4Char;<br>
begin<br>
s := 'Hello, 🌎!';<br>
for u in s do<br>
Writeln(HexStr(Ord(u), 8));<br>
end.<br>
<br>
=== code end ===<br>
<br>
Regards,<br>
Sven<br>
</body>
</html>