[fpc-pascal] Parse unicode scalar

Nikolay Nikolov nickysn at gmail.com
Tue Jul 4 06:47:51 CEST 2023


On 7/4/23 07:45, Nikolay Nikolov wrote:
>
> On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:
>>
>>> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal 
>>> <fpc-pascal at lists.freepascal.org> wrote:
>>>
>>> For what grammar? What characters are allowed in a token? For 
>>> example, Free Pascal also has a parser/tokenizer, but since Pascal 
>>> keywords are ASCII only, it doesn't need to understand Unicode 
>>> characters, so it works on the byte (Pascal's char type) level (for 
>>> UTF-8 files, this means UTF-8 Unicode code units). That's because 
>>> UTF-8 has two nice properties:
>>>
>>> 1)  ASCII character are encoded as they are - by using bytes in the 
>>> range #0..#127
>>>
>>> 2) non-ASCII characters will always use a sequence of bytes, that 
>>> are all in the range #128..#255 (they have their highest bit set), 
>>> so they will never be misinterpreted as ASCII.
>>>
>>> So, the tokenizer just works with UTF-8 like with any other 8-bit 
>>> code page.
>> yes this works until you reach a non-ASCII ranged character and then 
>> the character index no longer matches the string 1 to 1. For example 
>> consider this was pascal:
>>
>> i := '🐻';
>>
>> You can advance by index like:
>>
>>   Inc(currrentIndex);
>>   c := text[currentIndex];
>>
>> but once you hit the bear the offset is now wrong so you can't 
>> advance to the next character by doing +1.
>
> But you just don't need to do this, in order to tokenize Pascal. The 
> beginning and the end of the string literal is the apostrophe, which 
> is ASCII. The bear is a sequence of UTF-8 code units (opaque to the 
> compiler), that will not be mistaken for an apostrophe, or end of 
> line, because they will have their high bit set. There's simply no 
> need for a Pascal tokenizer to iterate over UTF-8 code points, instead 
> of code units.

Sorry, the last sentence should read: "There's simply no need for a 
Pascal tokenizer to iterate over Unicode code points, instead of UTF-8 
code units." Hope that makes it more clear and accurate.

Nikolay



More information about the fpc-pascal mailing list