[fpc-devel] ansistrings and widestrings

Thu Jan 6 01:00:43 CET 2005

> PPS. AFAIK UTF-8 is not used internally in any OS - it's only
> used for storing
> UNICODE text in more compact form - web site authors really like it.

i belive a lot of linux distros are switching to it for the console at least
for less common languages i don't know how gui stuff on linux handles text.
The windows routines for going from utf-16 to local codesets and back can
also go from utf-16 to utf-7 and utf-8 and back but i don't think windows
itself actually makes any real use of those encodings.

UTF-8 is smaller than UTF-16 in some cases larger in others and about the
same in still others it largely depends on what code points dominate the
text. An appropriate national encoding will usually always beat both of them
if it can represent the needed code points.

mainly $000000-$00007F utf-8 : 1 byte  utf-16: 2 bytes utf-32 4 bytes.
mainly $000080-$0007FF utf-8 : 2 bytes utf-16: 2 bytes utf-32 4 bytes.
mainly $000800-$00FFFF utf-8 : 3 bytes utf-16: 2 bytes utf-32 4 bytes.
mainly $010000-$10FFFF utf-8 : 4 bytes utf-16: 4 bytes utf-32 4 bytes.

the net result is that utf-8 tends to win for largely latin languages UTF-16
tends to win for largely ideographic languages and they are about on a par
for everything else. utf-32 nearly always loses to both (though it does have
a large spare codespace which can be used for special meanings internal to
the app).

the main advatages of utf-8 over utf-16 are
1: is a superset of 7 bit ascii
2: its not peppperd with 0 bytes.
3: any charachtor can ONLY be represented by 1 byte pattern and that byte
patten can ONLY represent that charachtor (it can't be a part of another
charachtor)
4: its easy to resync a badly cut/joined stream (if you cut a utf-16 stream
in the middle of a charachtor on of the peices will be total garbage).

the net result is that most code designed to deal with "ascii with
extentions" can be fed utf-8 and it will usually work fine or only require
minimal changes.

i still belive that the best way to handle ansistring<-->widestring
conversion is to use a fallback conversion (either 7 bit ascii or
iso-8859-1) by default and then provide units that override the conversion
with versions based on the local charset of the environment or a charset
specified by the application coder. Unfortunately as i have said whilst
there is an interface in place for overriding the conversion it is currently
only usable where the local code is single byte rather than mixed width.