[fpc-pascal] UnicodeString and surrogate pairs

Michael Schnell mschnell at lumino.de
Fri Apr 29 11:23:57 CEST 2016


On 04/29/2016 11:09 AM, Graeme Geldenhuys wrote:
>
> No, because UTF-8 doesn't use surrogate pairs.
Really ?

I understand that "surrogate pairs" is combining a printable character 
(i.e on of the nearly 2^32 UTF thingies) with another of those to be 
combined to a different printable thingy (/e.g. "A" plus "add two dots 
above" to crate a "Ä").

Now a series of 32-bit UTF thingies can be compressed to as well a 
series of UTF8 encoded bytes or as a series of UTF16 encoded words. Both 
of which usually is much shorter (measured in bytes) than the 
uncompressed UTF32 information.

So the UTF8 vs UTF16 issue is a lower layer of encoding.

-Michael



More information about the fpc-pascal mailing list