[fpc-pascal] Parse unicode scalar

Tue Jul 4 06:45:20 CEST 2023

On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:
>
>> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>>
>> For what grammar? What characters are allowed in a token? For example, Free Pascal also has a parser/tokenizer, but since Pascal keywords are ASCII only, it doesn't need to understand Unicode characters, so it works on the byte (Pascal's char type) level (for UTF-8 files, this means UTF-8 Unicode code units). That's because UTF-8 has two nice properties:
>>
>> 1)  ASCII character are encoded as they are - by using bytes in the range #0..#127
>>
>> 2) non-ASCII characters will always use a sequence of bytes, that are all in the range #128..#255 (they have their highest bit set), so they will never be misinterpreted as ASCII.
>>
>> So, the tokenizer just works with UTF-8 like with any other 8-bit code page.
> yes this works until you reach a non-ASCII ranged character and then the character index no longer matches the string 1 to 1. For example consider this was pascal:
>
> i := '🐻';
>
> You can advance by index like:
>
>   Inc(currrentIndex);
>   c := text[currentIndex];
>
> but once you hit the bear the offset is now wrong so you can't advance to the next character by doing +1.

But you just don't need to do this, in order to tokenize Pascal. The 
beginning and the end of the string literal is the apostrophe, which is 
ASCII. The bear is a sequence of UTF-8 code units (opaque to the 
compiler), that will not be mistaken for an apostrophe, or end of line, 
because they will have their high bit set. There's simply no need for a 
Pascal tokenizer to iterate over UTF-8 code points, instead of code units.

Nikolay