[fpc-pascal] Yet another thread on Unicode Strings
Tony Whyman
tony.whyman at mccallumwhyman.com
Wed Oct 4 14:10:02 CEST 2017
Unicode Character String handling is a question that keeps coming up on
the Free Pascal Mailing lists and, empirically, it is hard to avoid the
conclusion that there is something wrong with the way these character
string types are handled. Otherwise, why does this issue keep arising?
Supporters of the current implementation point to the rich set of
functions available to handle both UTF-8 and UTF-16 in addition to
legacy ANSI code pages. That is true – but it may be that it is also the
problem. The programmer is too often forced to be aware of how strings
are encoded and must make a choice as to which is the preferred
character encoding for their program. There then follows confusion over
how to make that choice. Is Delphi compatibility the goal? What
Languages must I support? If I want platform independence which is the
best encoding? Which encoding gives the best performance for my
algorithm? And so on.
Another problem is that there is no character type for a Unicode
Character. The built-in type “WideChar” is only two bytes and cannot
hold a UTF-16 code point comprising two surrogate pairs. There is no
char type for a UTF-8 character and, while UCS4Char exists, the Lazarus
UTF-8 utilities use “cardinal” as the type for a code point (not exactly
strong typing).
In order to stop all this confusion I believe that there has to be a
return to Pascal's original fundamental concept. That is the value of a
character type represents a character, while the encoding of the
character is platform dependent and a choice the compiler makes and not
the programmer. Likewise a character string is an array of characters
that can be indexed by character (not byte) number, from which
substrings can be selected and compared with other strings according to
the locale and the unicode standard collating sequence. Let the
programmer worry about the algorithm and the compiler worry about the
best implementation.
I want to propose a new character type called “UniChar” - short for
Unicode Character, along with a new string type “UniString” and a new
collection “TUniStrings”. I have presented my thoughts here in a
detailed paper
see https://mwasoftware.co.uk/docs/unistringproposal.pdf
This is intended to be a fully worked proposal and I have circulated it
to provoke discussion and in the hope that it may be useful.
The intent is to create a character and string handling design that is
natural to use with the programmer rarely if ever having to think about
the character or string encoding. They are dealing with Unicode
Characters and strings of Unicode Characters and that is all. When
necessary, transliteration happens naturally and as a consequence of
string concatenation, input/output, or in the rare cases when
performance demands a specific character encoding.
There is also a strong desire to avoid creating more choice and hence
more confusion. The intent is to “embrace and replace”. Both AnsiString
and UnicodeString should be seen as subsets or special cases of the
proposed UniString, and with concrete types such as AnsiChar, WideChar
and WideString, other than for legacy reasons, existing primarily to
define external interfaces.
Tony Whyman
MWA Software
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20171004/b1320659/attachment.html>
More information about the fpc-pascal
mailing list