[fpc-pascal] Parse unicode scalar

Tue Jul 4 06:28:47 CEST 2023

On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote:
>
>> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>>
>> You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text display, are you searching for something in the text, are you just passing strings around and letting the GUI deal with it? These are all different use cases, and they require careful understanding what Unicode thing you need to iterate over.
> Thanks for trying to help but this is more complicated than I thought and I don't have the patience for a deep dive right now :)
>
> Unicode is complicated under the hood but we should have some libraries to help right? I mean the user thinks of these things as "characters" be it "A" or the unicode symbol 👍 so we should be able to operate on that basis as well. Something like an iterator that return the character (wide char) and  byte offset or writing would be a nice place to start.
>
> I have a parser/tokenizer I want to update so I'm trying to find tokens by advancing one character at a time. That's why I have a requirement to know which character is next in the file and probably the byte offset also so it can be referenced later.

For what grammar? What characters are allowed in a token? For example, 
Free Pascal also has a parser/tokenizer, but since Pascal keywords are 
ASCII only, it doesn't need to understand Unicode characters, so it 
works on the byte (Pascal's char type) level (for UTF-8 files, this 
means UTF-8 Unicode code units). That's because UTF-8 has two nice 
properties:

1)  ASCII character are encoded as they are - by using bytes in the 
range #0..#127

2) non-ASCII characters will always use a sequence of bytes, that are 
all in the range #128..#255 (they have their highest bit set), so they 
will never be misinterpreted as ASCII.

So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

Nikolay