[fpc-pascal] Parse unicode scalar
Mattias Gaertner
nc-gaertnma at netcologne.de
Mon Jul 3 11:29:53 CEST 2023
On Mon, 3 Jul 2023 15:27:10 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>[...]
> I was just curious how ChatGPTs implementation compared to other
> programmer.
Apparently the quality is often terrible. But it can be useful.
> What I'm really trying to do is improve a parser so it can read UTF-8
> files and decode unicode literals in the grammar.
First of all: Is it valid UTF-8 or do you have to check for broken or
malicious sequences?
> Right now I've just read the file into an AnsiString and indexing
> assuming a fixed character size, which breaks of course if non-1 byte
> characters exist
Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
function UTF8CodepointToUnicode(p: PChar; out CodepointLen: integer): Cardinal;
> I also need to know if I come across something like \u1F496 I need
> to convert that to a unicode character.
I guess you know how to convert a hex to a dword. Then
function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to UTF8
function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar): integer; // UTF32 to UTF8
Mattias
More information about the fpc-pascal
mailing list