[fpc-pascal] Parse unicode scalar
nc-gaertnma at netcologne.de
Mon Jul 3 20:15:25 CEST 2023
On Mon, 3 Jul 2023 17:18:56 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
> > First of all: Is it valid UTF-8 or do you have to check for broken
> > or malicious sequences?
> If they give the parser broken files that's their problem they need
> to fix? the user has control over the file so it's there
> responsibility I think.
- I recommend to check for malicious codes. ;)
> >> Right now I've just read the file into an AnsiString and indexing
> >> assuming a fixed character size, which breaks of course if non-1
> >> byte characters exist
> > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> > function UTF8CodepointToUnicode(p: PChar; out CodepointLen:
> > integer): Cardinal;
> Not sure how this works. You need to advance by character so there
> return value should be the byte location of the next character or
> something like that.
function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
while (ByteCount>0) do begin
...do something with the CodePoint...
> >> I also need to know if I come across something like \u1F496 I need
> >> to convert that to a unicode character.
> > I guess you know how to convert a hex to a dword.
> Is there anything better than StrToInt?
> I wouldn't be able to do it
> myself though without that function.
Hex to dword. That's easy enough for ChatGPT.
> > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to
> > UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar):
> > integer; // UTF32 to UTF8
> Ok I think this is basically what the other programmer submitted and
> what ChatGPT tried to do.
Yes, no need to reinvent the wheel.
More information about the fpc-pascal