[fpc-pascal] Parse unicode scalar
Nikolay Nikolov
nickysn at gmail.com
Tue Jul 4 06:47:51 CEST 2023
On 7/4/23 07:45, Nikolay Nikolov wrote:
>
> On 7/4/23 07:40, Hairy Pixels via fpc-pascal wrote:
>>
>>> On Jul 4, 2023, at 11:28 AM, Nikolay Nikolov via fpc-pascal
>>> <fpc-pascal at lists.freepascal.org> wrote:
>>>
>>> For what grammar? What characters are allowed in a token? For
>>> example, Free Pascal also has a parser/tokenizer, but since Pascal
>>> keywords are ASCII only, it doesn't need to understand Unicode
>>> characters, so it works on the byte (Pascal's char type) level (for
>>> UTF-8 files, this means UTF-8 Unicode code units). That's because
>>> UTF-8 has two nice properties:
>>>
>>> 1) ASCII character are encoded as they are - by using bytes in the
>>> range #0..#127
>>>
>>> 2) non-ASCII characters will always use a sequence of bytes, that
>>> are all in the range #128..#255 (they have their highest bit set),
>>> so they will never be misinterpreted as ASCII.
>>>
>>> So, the tokenizer just works with UTF-8 like with any other 8-bit
>>> code page.
>> yes this works until you reach a non-ASCII ranged character and then
>> the character index no longer matches the string 1 to 1. For example
>> consider this was pascal:
>>
>> i := '🐻';
>>
>> You can advance by index like:
>>
>> Inc(currrentIndex);
>> c := text[currentIndex];
>>
>> but once you hit the bear the offset is now wrong so you can't
>> advance to the next character by doing +1.
>
> But you just don't need to do this, in order to tokenize Pascal. The
> beginning and the end of the string literal is the apostrophe, which
> is ASCII. The bear is a sequence of UTF-8 code units (opaque to the
> compiler), that will not be mistaken for an apostrophe, or end of
> line, because they will have their high bit set. There's simply no
> need for a Pascal tokenizer to iterate over UTF-8 code points, instead
> of code units.
Sorry, the last sentence should read: "There's simply no need for a
Pascal tokenizer to iterate over Unicode code points, instead of UTF-8
code units." Hope that makes it more clear and accurate.
Nikolay
More information about the fpc-pascal
mailing list