[fpc-pascal] UnicodeString and surrogate pairs

Graeme Geldenhuys mailinglists at geldenhuys.co.uk
Sat Apr 30 12:12:35 CEST 2016

Hello Michael,

On 2016-04-29 at 11:23 you wrote:
> > No, because UTF-8 doesn't use surrogate pairs.  
> Really ?


> those to be combined to a different printable thingy (/e.g. "A" plus
> "add two dots above" to crate a "Ä").

No, that is something totally different and not what I was talking
about. You are refering to combining diacritics. Two or more code-points
(think "characters") combined to make a new looking single character on
screen or printed.

> Both of which usually is much shorter (measured in bytes) than the 
> uncompressed UTF32 information.

Without you using the correct terminology, I think you are refering to
composed and decomposed formats of a character.

For example:

   e (U+0065) + ̈  (U+0308) = ë  (2 code-points used)
   e (U+0065) + ̈  (U+0308) -->  ë (1 code-point used)

The first example above results in the decomposed version of ë. The
second example above results in the composed version of ë.

The decomposed versions are the prefered and recommended way by the
Unicode Consortium. They (the Unicode Consortium) only included the
composed versions for backward compatibility with existing character
sets - when the Unicode standard was established. No new composed
code-points will be added to the Unicode standard.

Anyway, I was refering to surrogate pairs (applies to UTF-16 only), not
composed/decomposed glyphs.


More information about the fpc-pascal mailing list