[fpc-pascal] Parse unicode scalar
Mattias Gaertner
nc-gaertnma at netcologne.de
Mon Jul 3 09:59:24 CEST 2023
On Mon, 3 Jul 2023 14:12:03 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:
> > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal
> > <fpc-pascal at lists.freepascal.org> wrote:
> >
> > No - in this case, the "header" is the highest bit of that byte
> > being 0.
>
> Oh it's the header BIT. Admittedly I don't understand how this
> function returns the highest bit using that case, which I think he
> was suggesting.
A first byte of an UTF-8 codepoint is 0..127,192..247.
The second, third, fourth byte are between 128..191, so you can easily
detect where a codepoint starts.
And from the first byte you can derive the length of the codepoint.
If you just want to skip over n codepoints, then the below function does
the job:
> function UTF8CodepointSizeFast(p: PChar): integer;
> begin
> case p^ of
> #0..#191 : Result := 1;
> #192..#223 : Result := 2;
> #224..#239 : Result := 3;
> #240..#247 : Result := 4;
> else Result := 1; // An optimization + prevents compiler warning
> about uninitialized Result. end;
> end;
Mattias
More information about the fpc-pascal
mailing list