[fpc-pascal] Parse unicode scalar
    Mattias Gaertner 
    nc-gaertnma at netcologne.de
       
    Mon Jul  3 20:15:25 CEST 2023
    
    
  
On Mon, 3 Jul 2023 17:18:56 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>[...]
> > First of all: Is it valid UTF-8 or do you have to check for broken
> > or malicious sequences?  
> 
> If they give the parser broken files that's their problem they need
> to fix? the user has control over the file so it's there
> responsibility I think.
Users responsibility?
 - I recommend to check for malicious codes. ;)
> >> Right now I've just read the file into an AnsiString and indexing
> >> assuming a fixed character size, which breaks of course if non-1
> >> byte characters exist  
> > 
> > Sounds like UTF8CodepointToUnicode in unit LazUTF8 could be useful:
> > 
> > function UTF8CodepointToUnicode(p: PChar; out CodepointLen:
> > integer): Cardinal;  
> 
> Not sure how this works. You need to advance by character so there
> return value should be the byte location of the next character or
> something like that.
function ReadUTF8(p: PChar; ByteCount: PtrInt): PtrInt;
// returns the number of codepoints
var
  CodePointLen: longint;
  CodePoint: longword;
begin
  Result:=0;
  while (ByteCount>0) do begin
    inc(Result);
    CodePoint:=UTF8CodepointToUnicode(p,CodePointLen);
    ...do something with the CodePoint...
    inc(p,CodePointLen);
    dec(ByteCount,CodePointLen);
  end;
end;
> >> I also need to know if I come across something like \u1F496 I need
> >> to convert that to a unicode character.  
> > 
> > I guess you know how to convert a hex to a dword.  
> 
> Is there anything better than StrToInt?
Good start.
> I wouldn't be able to do it
> myself though without that function.
Hex to dword. That's easy enough for ChatGPT.
> > function UnicodeToUTF8(CodePoint: cardinal): string; // UTF32 to
> > UTF8 function UnicodeToUTF8(CodePoint: cardinal; Buf: PChar):
> > integer; // UTF32 to UTF8 
> 
> Ok I think this is basically what the other programmer submitted and
> what ChatGPT tried to do.
Yes, no need to reinvent the wheel.
Mattias
    
    
More information about the fpc-pascal
mailing list