[fpc-pascal] Parse unicode scalar
Mattias Gaertner
nc-gaertnma at netcologne.de
Mon Jul 3 06:43:29 CEST 2023
On Mon, 3 Jul 2023 08:29:11 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
> > On Jul 2, 2023, at 11:16 PM, Jer Haan <jdehaan2014 at gmail.com> wrote:
> >
> > This table is copied from Wikipedia.<uencoding.pas>Hope it’s useful
> > for you. If you improve the code pls let me know.
>
> This is perfect, thanks! Much more complicated than I thought.
>
> I'm curious now, if you were going the other direction and parsing a
> string of different unicode characters with different code point
> sequence lengths how would you know which length it was? For example
> I started off know which unicode scalar to use by looking at a table
> but if I had to find the character is stream of text?
>
> I think UTF8 can have 1-4 byte characters so you could encounter 1
> byte character followed by 4 byte characters interleaved and there's
> no header or terminator for each character. How is this solved?
There is a header byte.
It depends, if you want to check for invalid UTF-8 sequences.
From LazUTF8:
function UTF8CodepointSizeFast(p: PChar): integer;
begin
case p^ of
#0..#191 : Result := 1;
#192..#223 : Result := 2;
#224..#239 : Result := 3;
#240..#247 : Result := 4;
else Result := 1; // An optimization + prevents compiler warning about uninitialized Result.
end;
end;
Mattias
More information about the fpc-pascal
mailing list