[fpc-pascal] Parse unicode scalar
Nikolay Nikolov
nickysn at gmail.com
Tue Jul 4 04:58:44 CEST 2023
On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:
>
>> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>>
>> function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
>> // returns the number of codepoints
>> var
>> CodePointLen: longint;
>> CodePoint: longword;
>> begin
>> Result:=0;
>> while (ByteCount>0) do begin
>> inc(Result);
>> CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
>> ...do something with the CodePoint...
>> inc(p,CodePointLen);
>> dec(ByteCount,CodePointLen);
>> end;
>> end;
> Thanks, this looks right. I guess this is how we need to iterate over unicode now.
>
> Btw, why isn't there a for-loop we can use over unicode strings? seems like that should be supported out of the box. I had this same problem in Swift also where it's extremely confusing to merely iterate over a string and look at each character. Replacing characters will be tricky also so we need some good library functions.
You're still confusing the Unicode terms. The above code iterates over
Unicode Code Points, not "characters" in a UTF-8 encoded string. A
Unicode Code Point is not a "character":
https://unicode.org/glossary/#character
https://unicode.org/glossary/#code_point
There are also graphemes, grapheme clusters and extended grapheme
clusters - these terms can also be perceived as "characters".
https://unicode.org/glossary/#grapheme
https://unicode.org/glossary/#grapheme_cluster
https://unicode.org/glossary/#extended_grapheme_cluster
If you want to iterate over extended grapheme clusters, for example,
there's an iterator (written by me) in the unit graphemebreakproperty.pp
in the rtl-unicode package.
If you use the 'char' type in Pascal to iterate over an UTF-8 encoded
string, you're iterating over Unicode code units (units! not code
points! https://unicode.org/glossary/#code_unit).
If you use the 'widechar' type in Pascal to iterate over a UnicodeString
(which is a UTF-16 encoded string), you're also iterating over Unicode
code units, however this time in UTF-16 encoding.
If you want to iterate over Unicode code points (not units! not
characters! not graphemes!) in a UTF-8 string, you need something like
the ReadUTF8 function above. If you want to iterate over Unicode code
points in a UTF-16 string, you need different code.
You need to understand all these terms and know exactly what you need to
do. E.g. are you dealing with keyboard input, are you dealing with the
low level parts of text display, are you searching for something in the
text, are you just passing strings around and letting the GUI deal with
it? These are all different use cases, and they require careful
understanding what Unicode thing you need to iterate over.
Nikolay
More information about the fpc-pascal
mailing list