[fpc-pascal] Parse unicode scalar

Tue Jul 4 04:58:44 CEST 2023

On 7/4/23 04:03, Hairy Pixels via fpc-pascal wrote:
>
>> On Jul 4, 2023, at 1:15 AM, Mattias Gaertner via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>>
>> function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
>> // returns the number of codepoints
>> var
>>   CodePointLen: longint;
>>   CodePoint: longword;
>> begin
>>   Result:=0;
>>   while (ByteCount>0) do begin
>>     inc(Result);
>>     CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
>>     ...do something with the CodePoint...
>>     inc(p,CodePointLen);
>>     dec(ByteCount,CodePointLen);
>>   end;
>> end;
> Thanks, this looks right. I guess this is how we need to iterate over unicode now.
>
> Btw, why isn't there a for-loop we can use over unicode strings? seems like that should be supported out of the box. I had this same problem in Swift also where it's extremely confusing to merely iterate over a string and look at each character. Replacing characters will be tricky also so we need some good library functions.

You're still confusing the Unicode terms. The above code iterates over 
Unicode Code Points, not "characters" in a UTF-8 encoded string. A 
Unicode Code Point is not a "character":

https://unicode.org/glossary/#character

https://unicode.org/glossary/#code_point

There are also graphemes, grapheme clusters and extended grapheme 
clusters - these terms can also be perceived as "characters".

https://unicode.org/glossary/#grapheme

https://unicode.org/glossary/#grapheme_cluster

https://unicode.org/glossary/#extended_grapheme_cluster

If you want to iterate over extended grapheme clusters, for example, 
there's an iterator (written by me) in the unit graphemebreakproperty.pp 
in the rtl-unicode package.

If you use the 'char' type in Pascal to iterate over an UTF-8 encoded 
string, you're iterating over Unicode code units (units! not code 
points! https://unicode.org/glossary/#code_unit).

If you use the 'widechar' type in Pascal to iterate over a UnicodeString 
(which is a UTF-16 encoded string), you're also iterating over Unicode 
code units, however this time in UTF-16 encoding.

If you want to iterate over Unicode code points (not units! not 
characters! not graphemes!) in a UTF-8 string, you need something like 
the ReadUTF8 function above. If you want to iterate over Unicode code 
points in a UTF-16 string, you need different code.

You need to understand all these terms and know exactly what you need to 
do. E.g. are you dealing with keyboard input, are you dealing with the 
low level parts of text display, are you searching for something in the 
text, are you just passing strings around and letting the GUI deal with 
it? These are all different use cases, and they require careful 
understanding what Unicode thing you need to iterate over.

Nikolay