[fpc-pascal] Parse unicode scalar
Hairy Pixels
genericptr at gmail.com
Mon Jul 3 12:18:56 CEST 2023
> On Jul 3, 2023, at 4:29 PM, Mattias Gaertner via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>
>> What I'm really trying to do is improve a parser so it can read UTF-8
>> files and decode unicode literals in the grammar.
>
> First of all: Is it valid UTF-8 or do you have to check for broken or
> malicious sequences?
If they give the parser broken files that's their problem they need to fix? the user has control over the file so it's there responsibility I think.
>
>
>> Right now I've just read the file into an AnsiString and indexing
>> assuming a fixed character size, which breaks of course if non-1 byte
>> characters exist
>
> Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
>
> function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal;
Not sure how this works. You need to advance by character so there return value should be the byte location of the next character or something like that.
>
>
>> I also need to know if I come across something like \u1F496 I need
>> to convert that to a unicode character.
>
> I guess you know how to convert a hex to a dword.
Is there anything better than StrToInt? I wouldn't be able to do it myself though without that function.
> Then
>
> function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
> function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to UTF8
>
Ok I think this is basically what the other programmer submitted and what ChatGPT tried to do.
Regards,
Ryan Joseph
More information about the fpc-pascal
mailing list