[fpc-pascal] Parse unicode scalar

Mon Jul 3 12:18:56 CEST 2023

> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
> 
>> What I'm really trying to do is improve a parser so it can read UTF-8
>> files and decode unicode literals in the grammar.
> 
> First of all: Is it valid UTF-8 or do you have to check for broken or
> malicious sequences?

If they give the parser broken files that's their problem they need to fix? the user has control over the file so it's there responsibility I think.

> 
> 
>> Right now I've just read the file into an AnsiString and indexing
>> assuming a fixed character size, which breaks of course if non-1 byte
>> characters exist
> 
> Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> 
> function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal;

Not sure how this works. You need to advance by character so there return value should be the byte location of the next character or something like that.

> 
> 
>> I also need to know if I come across something like \u1F496 I need
>> to convert that to a unicode character.
> 
> I guess you know how to convert a hex to a dword.

Is there anything better than StrToInt? I wouldn't be able to do it myself though without that function.

> Then
> 
> function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
> function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to UTF8
> 

Ok I think this is basically what the other programmer submitted and what ChatGPT tried to do.

Regards,
Ryan Joseph