[fpc-pascal] Yet another thread on Unicode Strings

Wed Oct 4 14:52:08 CEST 2017

On Wed, 4 Oct 2017 13:10:02 +0100
Tony Whyman <tony.whyman at mccallumwhyman.com> wrote:

> Unicode Character String handling is a question that keeps coming up on 
> the Free Pascal Mailing lists and, empirically, it is hard to avoid the 
> conclusion that there is something wrong with the way these character 
> string types are handled. Otherwise, why does this issue keep arising?

Mixing string types, mixing encodings, mixing legacy code, confusing
UCS-2 with UTF-16, ....

>[...]
> Another problem is that there is no character type for a Unicode 
> Character.

I'm curious: What languages have such a type?

> The built-in type “WideChar” is only two bytes and cannot 
> hold a UTF-16 code point comprising two surrogate pairs. There is no 
> char type for a UTF-8 character and, while UCS4Char exists, the Lazarus 
> UTF-8 utilities use “cardinal” as the type for a code point (not exactly 
> strong typing).

Should be remedied.

>[...]
>Let the programmer worry about the algorithm and the compiler worry about the 
best implementation.

An UTF-32 string type is seldom the best choice for memory
and/or speed.

>[...]
> I want to propose a new character type called “UniChar” - short for 
> Unicode Character, along with a new string type “UniString” and a new 
> collection “TUniStrings”. I have presented my thoughts here in a 
> detailed paper
> 
> see https://mwasoftware.co.uk/docs/unistringproposal.pdf
> 
> This is intended to be a fully worked proposal and I have circulated it 
> to provoke discussion and in the hope that it may be useful.

Adding another string type without disabling some old string types will
increase the confusion. Please provide a proposal for disabling old
string types.

Also keep in mind, that there is still no UTF-16 RTL, even though
many people need that for Delphi compatibility. Starting yet another
UTF-32 RTL need some heavy dedicated programmers.

Mattias