[fpc-pascal] Parse unicode scalar

Mon Jul 3 06:43:29 CEST 2023

On Mon, 3 Jul 2023 08:29:11 +0700
Hairy Pixels via fpc-pascal <fpc-pascal at lists.freepascal.org> wrote:

> > On Jul 2, 2023, at 11:16 PM, Jer Haan <jdehaan2014 at gmail.com> wrote:
> > 
> > This table is copied from Wikipedia.<uencoding.pas>Hope it’s useful
> > for you. If you improve the code pls let me know. 
> 
> This is perfect, thanks! Much more complicated than I thought.
> 
> I'm curious now, if you were going the other direction and parsing a
> string of different unicode characters with different code point
> sequence lengths how would you know which length it was? For example
> I started off know which unicode scalar to use by looking at a table
> but if I had to find the character is stream of text?
> 
> I think UTF8 can have 1-4 byte characters so you could encounter 1
> byte character followed by 4 byte characters interleaved and there's
> no header or terminator for each character. How is this solved?

There is a header byte.

It depends, if you want to check for invalid UTF-8 sequences.

From LazUTF8:

function UTF8CodepointSizeFast(p: PChar): integer;
begin
  case p^ of
    #0..#191   : Result := 1;
    #192..#223 : Result := 2;
    #224..#239 : Result := 3;
    #240..#247 : Result := 4;
    else Result := 1; // An optimization + prevents compiler warning about uninitialized Result.
  end;
end;

Mattias