[fpc-pascal] Parse unicode scalar

Mon Jul 3 11:29:53 CEST 2023

On Mon, 3 Jul 2023 15:27:10 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:

>[...]
> I was just curious how ChatGPTs implementation compared to other
> programmer.

Apparently the quality is often terrible. But it can be useful.

> What I'm really trying to do is improve a parser so it can read UTF-8
> files and decode unicode literals in the grammar.

First of all: Is it valid UTF-8 or do you have to check for broken or
malicious sequences?

> Right now I've just read the file into an AnsiString and indexing
> assuming a fixed character size, which breaks of course if non-1 byte
> characters exist

Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:

function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal;

>  I also need to know if I come across something like \u1F496 I need
> to convert that to a unicode character.

I guess you know how to convert a hex to a dword. Then

function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to UTF8

Mattias