[fpc-pascal] Parse unicode scalar
Nikolay Nikolov
nickysn at gmail.com
Tue Jul 4 06:28:47 CEST 2023
On 7/4/23 07:17, Hairy Pixels via fpc-pascal wrote:
>
>> On Jul 4, 2023, at 9:58 AM, Nikolay Nikolov via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>>
>> You need to understand all these terms and know exactly what you need to do. E.g. are you dealing with keyboard input, are you dealing with the low level parts of text display, are you searching for something in the text, are you just passing strings around and letting the GUI deal with it? These are all different use cases, and they require careful understanding what Unicode thing you need to iterate over.
> Thanks for trying to help but this is more complicated than I thought and I don't have the patience for a deep dive right now :)
>
> Unicode is complicated under the hood but we should have some libraries to help right? I mean the user thinks of these things as "characters" be it "A" or the unicode symbol 👍 so we should be able to operate on that basis as well. Something like an iterator that return the character (wide char) and byte offset or writing would be a nice place to start.
>
> I have a parser/tokenizer I want to update so I'm trying to find tokens by advancing one character at a time. That's why I have a requirement to know which character is next in the file and probably the byte offset also so it can be referenced later.
For what grammar? What characters are allowed in a token? For example,
Free Pascal also has a parser/tokenizer, but since Pascal keywords are
ASCII only, it doesn't need to understand Unicode characters, so it
works on the byte (Pascal's char type) level (for UTF-8 files, this
means UTF-8 Unicode code units). That's because UTF-8 has two nice
properties:
1) ASCII character are encoded as they are - by using bytes in the
range #0..#127
2) non-ASCII characters will always use a sequence of bytes, that are
all in the range #128..#255 (they have their highest bit set), so they
will never be misinterpreted as ASCII.
So, the tokenizer just works with UTF-8 like with any other 8-bit code page.
Nikolay
More information about the fpc-pascal
mailing list