[fpc-pascal] Printing unicode characters
Sven Barth
pascaldragon at googlemail.com
Sun Dec 1 19:47:04 CET 2024
Am 01.12.2024 um 14:37 schrieb Hairy Pixels via fpc-pascal:
> On Dec 1, 2024 at 2:23:08 PM, Nikolay Nikolov via fpc-pascal
> <fpc-pascal at lists.freepascal.org> wrote:
>> Here's how Free Pascal types map to Unicode terminology:
>>
>> WideChar = UTF-16 code unit
>>
>> UnicodeString = UTF-16 encoded string
>>
>> WideString = UTF-16 encoded string. On Windows it's not reference
>> counted - used for COM compatibility. On other platforms, it's the same
>> as UnicodeString.
>>
>> UTF8String = UTF-8 encoded string. Defined as UTF8String=type
>> AnsiString(CP_UTF8).
>>
>> UTF16String = alias for UnicodeString
>>
>> Hope this clears things up.
>>
>>
>> Another thing:
>>
>> For conversions between different encodings to work (e.g. between UTF-8
>> and UTF-16), you need to install a widestring manager. Some platforms
>> (like Windows) always include one by default, but other platforms (e.g.
>> Linux) don't, in order to reduce bloat, for programs that don't need it.
>> For these, you may need to include unit cwstring or something like that.
>
> Including that unit is sneaky, seems you need it anytime dealing with
> unicode. Not sure how it even knows to change the meaning of those
> character constants.
There is nothing sneaky about this. This is simply how things work in
FPC to avoid linking against the C-library (or including quite a load of
Unicode data in case of fpwidestring instead of cwstring) when for much
code it isn't necessary (just like the need to use unit cthreads on
*nix-systems to install the threading manager).
> Using the term “char” was maybe a mistake. This misleads people into
> thinking it’s a “character” as they perceive it but really it’s just a
> code point.
There isn't much choice here, cause that type name exists from old
Pascal times and that will not change (well, okay, it will change in so
far as when the Unicode RTL is enabled it will be Char = WideChar
instead of Char = AnsiChar as it is now).
> Why isn’t there a “UnicodeChar” type which is 4 bytes and hold a full
> UTF-8 character?
There is, it's called UCS4Char. Also it's not a "full UTF-8 character",
but simply a "Unicode code point".
> Choosing UTF-16 for UnicodeString was probably a mistake too.
Take that up with Borland, cause they termed it as "UnicodeString".
Which is mainly because they only had to deal with Windows compatibility
where there either were the single Byte encodings or the UTF-16 encoding.
> It’s my understanding all websites are UTF-8 which means this encoding
> will dominate everything. I think UTF-8 is by far the most used right?
UTF-8 is usually used for textual encoding, because it is the most
memory dense Unicode encoding, however many languages or runtimes
including JavaScript, Java's JVM, the .Net CLR, Windows, Qt, UEFI and
Delphi >= 2009 use UTF-16 internally.
> As a user I would expect to take a string constant and assigning it to
> a UnicodeString would let me iterate over UnicodeChar. That’s logical
> right? Maybe this is just left undone as of now. I don’t know.
>
> var
> u: UnicodeChar;
> s: UnicodeString;
> begin
> s := 'Hello, 🌎!';
> for u in s do
> writeln(u);
Here you go:
=== code begin ===
program tstrenum;
{$codepage utf8}
{$mode objfpc}{$H+}
{$modeswitch advancedrecords}
type
TUCS4CharUnicodeStrEnumerator = record
private
fStr: UnicodeString;
fIndex: SizeInt;
fCurrent: UCS4Char;
public
constructor Create(const aStr: UnicodeString);
function MoveNext: Boolean;
property Current: UCS4Char read fCurrent;
end;
constructor TUCS4CharUnicodeStrEnumerator.Create(const aStr: UnicodeString);
begin
fStr := aStr;
fIndex := -1;
fCurrent := 0;
end;
function TUCS4CharUnicodeStrEnumerator.MoveNext: Boolean;
begin
Inc(fIndex);
if fIndex > Length(fStr) then
Exit(False);
if (Ord(fStr[fIndex]) >= $D800) and (Ord(fStr[fIndex]) <= $DBFF) then
begin
if fIndex < High(fStr) then begin
if (Ord(fStr[fIndex + 1]) >= $DC00) and (Ord(fStr[fIndex + 1]) <=
$DFFF) then begin
fCurrent := UCS4Char(Ord(fStr[fIndex]) - $D800) shl 10 +
UCS4Char(Ord(fStr[fIndex + 1])) - $DC00 + $10000;
Inc(fIndex);
end else
fCurrent := Ord(fStr[fIndex]);
end else
fCurrent := Ord(fStr[fIndex]);
end else
fCurrent := Ord(fStr[fIndex]);
Result := True;
end;
operator Enumerator(const aStr: UnicodeString):
TUCS4CharUnicodeStrEnumerator;
begin
Result := TUCS4CharUnicodeStrEnumerator.Create(aStr);
end;
var
s: UnicodeString;
u: UCS4Char;
begin
s := 'Hello, 🌎!';
for u in s do
Writeln(HexStr(Ord(u), 8));
end.
=== code end ===
Regards,
Sven
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20241201/b099a399/attachment.htm>
More information about the fpc-pascal
mailing list