[fpc-pascal] Printing unicode characters

Sun Dec 1 19:47:04 CET 2024

Am 01.12.2024 um 14:37 schrieb Hairy Pixels via fpc-pascal:
> On Dec 1, 2024 at 2:23:08 PM, Nikolay Nikolov via fpc-pascal 
> <fpc-pascal at lists.freepascal.org> wrote:
>> Here's how Free Pascal types map to Unicode terminology:
>>
>> WideChar = UTF-16 code unit
>>
>> UnicodeString = UTF-16 encoded string
>>
>> WideString = UTF-16 encoded string. On Windows it's not reference
>> counted - used for COM compatibility. On other platforms, it's the same
>> as UnicodeString.
>>
>> UTF8String = UTF-8 encoded string. Defined as UTF8String=type
>> AnsiString(CP_UTF8).
>>
>> UTF16String = alias for UnicodeString
>>
>> Hope this clears things up.
>>
>>
>> Another thing:
>>
>> For conversions between different encodings to work (e.g. between UTF-8
>> and UTF-16), you need to install a widestring manager. Some platforms
>> (like Windows) always include one by default, but other platforms (e.g.
>> Linux) don't, in order to reduce bloat, for programs that don't need it.
>> For these, you may need to include unit cwstring or something like that. 
>
> Including that unit is sneaky, seems you need it anytime dealing with 
> unicode. Not sure how it even knows to change the meaning of those 
> character constants.

There is nothing sneaky about this. This is simply how things work in 
FPC to avoid linking against the C-library (or including quite a load of 
Unicode data in case of fpwidestring instead of cwstring) when for much 
code it isn't necessary (just like the need to use unit cthreads on 
*nix-systems to install the threading manager).

> Using the term “char” was maybe a mistake. This misleads people into 
> thinking it’s a “character” as they perceive it but really it’s just a 
> code point.

There isn't much choice here, cause that type name exists from old 
Pascal times and that will not change (well, okay, it will change in so 
far as when the Unicode RTL is enabled it will be Char = WideChar 
instead of Char = AnsiChar as it is now).

> Why isn’t there a “UnicodeChar” type which is 4 bytes and hold a full 
> UTF-8 character?

There is, it's called UCS4Char. Also it's not a "full UTF-8 character", 
but simply a "Unicode code point".

> Choosing UTF-16 for UnicodeString was probably a mistake too.

Take that up with Borland, cause they termed it as "UnicodeString". 
Which is mainly because they only had to deal with Windows compatibility 
where there either were the single Byte encodings or the UTF-16 encoding.

> It’s my understanding all websites are UTF-8 which means this encoding 
> will dominate everything. I think UTF-8  is by far the most used right?

UTF-8 is usually used for textual encoding, because it is the most 
memory dense Unicode encoding, however many languages or runtimes 
including JavaScript, Java's JVM, the .Net CLR, Windows, Qt, UEFI and 
Delphi >= 2009 use UTF-16 internally.

> As a user I would expect to take a string constant and assigning it to 
> a UnicodeString would let me iterate over UnicodeChar. That’s logical 
> right?  Maybe this is just left undone as of now. I don’t know.
>
> var
>   u: UnicodeChar;
>   s: UnicodeString;
> begin
>   s := 'Hello, 🌎!';
>   for u in s do
>     writeln(u);

Here you go:

=== code begin ===

program tstrenum;

{$codepage utf8}
{$mode objfpc}{$H+}
{$modeswitch advancedrecords}

type
   TUCS4CharUnicodeStrEnumerator = record
   private
     fStr: UnicodeString;
     fIndex: SizeInt;
     fCurrent: UCS4Char;
   public
     constructor Create(const aStr: UnicodeString);
     function MoveNext: Boolean;
     property Current: UCS4Char read fCurrent;
   end;

constructor TUCS4CharUnicodeStrEnumerator.Create(const aStr: UnicodeString);
begin
   fStr := aStr;
   fIndex := -1;
   fCurrent := 0;
end;

function TUCS4CharUnicodeStrEnumerator.MoveNext: Boolean;
begin
   Inc(fIndex);
   if fIndex > Length(fStr) then
     Exit(False);
   if (Ord(fStr[fIndex]) >= $D800) and (Ord(fStr[fIndex]) <= $DBFF) then 
begin
     if fIndex < High(fStr) then begin
       if (Ord(fStr[fIndex + 1]) >= $DC00) and (Ord(fStr[fIndex + 1]) <= 
$DFFF) then begin
         fCurrent := UCS4Char(Ord(fStr[fIndex]) - $D800) shl 10 + 
UCS4Char(Ord(fStr[fIndex + 1])) - $DC00 + $10000;
         Inc(fIndex);
       end else
         fCurrent := Ord(fStr[fIndex]);
     end else
       fCurrent := Ord(fStr[fIndex]);
   end else
     fCurrent := Ord(fStr[fIndex]);
   Result := True;
end;

operator Enumerator(const aStr: UnicodeString): 
TUCS4CharUnicodeStrEnumerator;
begin
   Result := TUCS4CharUnicodeStrEnumerator.Create(aStr);
end;

var
   s: UnicodeString;
   u: UCS4Char;
begin
   s := 'Hello, 🌎!';
   for u in s do
     Writeln(HexStr(Ord(u), 8));
end.

=== code end ===

Regards,
Sven
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20241201/b099a399/attachment.htm>