[fpc-pascal] Parse unicode scalar

Nikolay Nikolov nickysn at gmail.com
Tue Jul 4 07:08:26 CEST 2023


On 7/4/23 07:56, Hairy Pixels via fpc-pascal wrote:
>
>> On Jul 4, 2023, at 11:50 AM, Hairy Pixels <genericptr at gmail.com> wrote:
>>
>> You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser).
> Sorry I'm still curious even though it's not my current problem :)
>
> How can I make this program output the expected results:
>
>    w: widechar;
>    a: array of widechar;
> begin
>     for w in 'abc🐻' do
>       a += [w];
>    // Outputs 7 instead of 4
>    writeln(length(a));
> end;
>
> The user doesn't know about unicode they just want to get an array of characters and not worry about all these little details. What can FPC do to solve this problem?

Depends on what you need, but I suppose in this case you want to count 
the number of extended grapheme clusters (a.k.a. "user perceived 
characters" - how many character-like things are displayed on the 
screen). You might be tempted to count the number of Unicode code 
points, but that's not the same, due to the existence of combining 
characters:

https://en.wikipedia.org/wiki/Combining_character

For extended grapheme clusters, there's an iterator in the 
graphemebreakproperty unit. I implemented this for the Unicode KVM and 
FreeVision. There it's needed for figuring out how many character blocks 
in the console will be needed to display a certain string. For the 
console or other GUIs that use fixed width fonts, there's also the East 
Asian Width property as well - some characters (East Asian - Chinese, 
Japanese, Korean) take double the space. So, to figure out where to move 
the cursor, you need to take East Asian Width as well.

Nikolay



More information about the fpc-pascal mailing list