[fpc-pascal] Yet another thread on Unicode Strings

Wed Oct 4 17:06:01 CEST 2017

In our previous episode, Tony Whyman said:
> Unicode Character String handling is a question that keeps coming up on 
> the Free Pascal Mailing lists and, empirically, it is hard to avoid the 
> conclusion that there is something wrong with the way these character 
> string types are handled. Otherwise, why does this issue keep arising?

Because people have old code that is ascii, or handles unicode in a
different, ad-hoc matter. Moreover FPC/Lazarus is also still usable in an
ascii only mode for old projects.

> The programmer is too often forced to be aware of how strings 
> are encoded and must make a choice as to which is the preferred 
> character encoding for their program. There then follows confusion over 
> how to make that choice.

To avoid confusion, make sure it is unicode. It doesn't matter that
much if it is utf16 or not.

> Is Delphi compatibility the goal? What 
> Languages must I support? If I want platform independence which is the 
> best encoding? Which encoding gives the best performance for my 
> algorithm? And so on.

> Another problem is that there is no character type for a Unicode 
> Character. The built-in type ?WideChar? is only two bytes and cannot 
> hold a UTF-16 code point comprising two surrogate pairs. There is no 
> char type for a UTF-8 character and, while UCS4Char exists, the Lazarus 
> UTF-8 utilities use ?cardinal? as the type for a code point (not exactly 
> strong typing).

Most code will simply use "string" to hold a character. Only special and
code that really must be performant will do other things.

> In order to stop all this confusion I believe that there has to be a 
> return to Pascal's original fundamental concept. That is the value of a 
> character type represents a character, while the encoding of the 
> character is platform dependent and a choice the compiler makes and not 
> the programmer. Likewise a character string is an array of characters 
> that can be indexed by character (not byte) number, from which 
> substrings can be selected and compared with other strings according to 
> the locale and the unicode standard collating sequence. Let the 
> programmer worry about the algorithm and the compiler worry about the 
> best implementation.
>
> I want to propose a new character type called ?UniChar? - short for 
> Unicode Character, along with a new string type ?UniString? and a new 
> collection ?TUniStrings?. I have presented my thoughts here in a 
> detailed paper
>
This doesn't work, and it seems you haven't read the backlog for unicode
related messages all the way back to early 2009. What you suggest was one of
the null hypotheses back then, and we are now 8 years further.

Search for the unicode meanings of (1) glyph, (2) character (3) codepoint
(4) denormalized strings.

If you digest all that, you need to define the unichar type very large,
blowing up strings enormously, and then again converting it back to either
utf16 or utf8 to communicate with nearly anything (APIs, libraries etc)

Moreover it will just require yet another conversion and more confusion with
more competing systems. So the number of problems will only rise. And the
incompatibility to Delphi is still there, so will create trouble ad
infinitum.

This argument is best summed up by this cartoon: https://xkcd.com/927/

In short, there is no substitute than to actively learn what unicode is
about and live with it. 

Some of the problems were summed up in the discussion back then:
http://www.stack.nl/~marcov/unicode.pdf

Note that in hindsight I don't think Florian's proposal was that bad, and
Florian was somewhat vindicated by Delphi's choice for multi encoding
ansistring type.

My new opinion is that whatever the choice is, I think to choose different
from Delphi (despite all its flaws, perceived OR real, doesn't matter) was
wrong.