[fpc-pascal] Parse unicode scalar

Hairy Pixels genericptr at gmail.com
Tue Jul 4 06:40:45 CEST 2023



> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
> 
> For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties:
> 
> 1)  ASCII character are encoded as they are - by using bytes in the range #0..#127
> 
> 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII.
> 
> So, the tokenizer just works with UTF-8 like with any other 8-bit code page.

yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal:

i := '🐻';

You can advance by index like:

 Inc(currrentIndex);
 c := text[currentIndex];

but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1.

Regards,
Ryan Joseph



More information about the fpc-pascal mailing list