[fpc-devel] Re: enumerators

Thu Nov 18 15:07:03 CET 2010

In our previous episode, Michael Schnell said:
> > Either you have UTF-8 with surrogates, or you have ASCII (since UTF-8
> > without surrogates means that only char 0..127 are valid, which is ASCII)
> In another post surrogate pairs have been denoted as a specialty of a 16 
> Bit coding (UCS-2), and I did not understand why this was introduced in 
> a discussion about UTF-8. I just accepted that this somehow would leak 
> into UTF-8 as a special (alternate) way to code certain Unicode characters.

Surrogates are characters that can't be encoded in one encoding space.

About 40000 chars can be encoded in a 16-bit value in UTF16, and 127 in
8-bit UTF8. The rest must be encoded in multiple encoding spaces, and these
are called surrogates.

> I did not think about calling the up to four bytes of a normal UTF-8 
> "character" "surrogates" (to me these are "codes" or something like this).

Anything larger than 1 is surrogate in UTF-8. And since UTF-8 uses a bit to
signal a "larger" character, you end up with low ASCII 0..127 values.