[fpc-pascal] Parse unicode scalar
Hairy Pixels
genericptr at gmail.com
Mon Jul 3 06:58:33 CEST 2023
> On Jul 3, 2023, at 11:43 AM, Mattias Gaertner via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
>
> There is a header byte.
>
> It depends, if you want to check for invalid UTF-8 sequences.
>
> From LazUTF8:
>
> function UTF8CodepointSizeFast(p: PChar): integer;
> begin
> case p^ of
> #0..#191 : Result := 1;
> #192..#223 : Result := 2;
> #224..#239 : Result := 3;
> #240..#247 : Result := 4;
> else Result := 1; // An optimization + prevents compiler warning about uninitialized Result.
> end;
> end;
This is a header for the file? Does that mean the file itself must have uniform character sizes? I though the idea was to read the file one byte at a time but I don't understand how you would know if a 1 byte character (like ascii) was part of a 4 byte character or not.
Regards,
Ryan Joseph
More information about the fpc-pascal
mailing list