[fpc-pascal] Parse unicode scalar

Mon Jul 3 09:59:24 CEST 2023

On Mon, 3 Jul 2023 14:12:03 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:

> > On Jul 3, 2023, at 2:04 PM, Tomas Hajny via fpc-pascal
> > <fpc-pascal at lists.freepascal.org> wrote:
> > 
> > No - in this case, the "header" is the highest bit of that byte
> > being 0.  
> 
> Oh it's the header BIT. Admittedly I don't understand how this
> function returns the highest bit using that case, which I think he
> was suggesting.

A first byte of an UTF-8 codepoint is 0..127,192..247.
The second, third, fourth byte are between 128..191, so you can easily
detect where a codepoint starts.
And from the first byte you can derive the length of the codepoint.
If you just want to skip over n codepoints, then the below function does
the job:

> function UTF8CodepointSizeFast(p: PChar): integer;
> begin
>  case p^ of
>    #0..#191   : Result := 1;
>    #192..#223 : Result := 2;
>    #224..#239 : Result := 3;
>    #240..#247 : Result := 4;
>    else Result := 1; // An optimization + prevents compiler warning
> about uninitialized Result. end;
> end;

Mattias