[fpc-pascal] Parse unicode scalar

Mon Jul 3 20:15:25 CEST 2023

On Mon, 3 Jul 2023 17:18:56 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:

>[...]
> > First of all: Is it valid UTF-8 or do you have to check for broken
> > or malicious sequences?  
> 
> If they give the parser broken files that's their problem they need
> to fix? the user has control over the file so it's there
> responsibility I think.

Users responsibility?
 - I recommend to check for malicious codes. ;)

> >> Right now I've just read the file into an AnsiString and indexing
> >> assuming a fixed character size, which breaks of course if non-1
> >> byte characters exist  
> > 
> > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> > 
> > function UTF8CodepointToUnicode(p: PChar; out CodepointLen:
> > integer): Cardinal;  
> 
> Not sure how this works. You need to advance by character so there
> return value should be the byte location of the next character or
> something like that.

function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
    inc(Result);
    CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
    ...do something with the CodePoint...
    inc(p,CodePointLen);
    dec(ByteCount,CodePointLen);
  end;
end;

> >> I also need to know if I come across something like \u1F496 I need
> >> to convert that to a unicode character.  
> > 
> > I guess you know how to convert a hex to a dword.  
> 
> Is there anything better than StrToInt?

Good start.

> I wouldn't be able to do it
> myself though without that function.

Hex to dword. That's easy enough for ChatGPT.

> > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to
> > UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar):
> > integer; // UTF32 to UTF8 
> 
> Ok I think this is basically what the other programmer submitted and
> what ChatGPT tried to do.

Yes, no need to reinvent the wheel.

Mattias