[fpc-pascal] Parse unicode scalar

Hairy Pixels genericptr at gmail.com
Tue Jul 4 06:50:58 CEST 2023



> On Jul 4, 2023, at 11:45 AM, Nikolay Nikolov via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
> 
> But you just don't need to do this, in order to tokenize Pascal. The beginning and the end of the string literal is the apostrophe, which is ASCII. The bear is a sequence of UTF-8 code units (opaque to the compiler), that will not be mistaken for an apostrophe, or end of line, because they will have their high bit set. There's simply no need for a Pascal tokenizer to iterate over UTF-8 code points, instead of code units.

You know you're right, with properly enclosed patterns you can capture everything inside and it works. You won't know if you had unicode in your string or not though but that depends on what's being parsed and if you care or not (I'm doing a TOML parser).

Maybe I can skip that part and just focus on the decoding of the unicode scalars

Regards,
Ryan Joseph



More information about the fpc-pascal mailing list