[fpc-pascal] String theory
Tony Whyman
tony.whyman at mccallumwhyman.com
Tue May 10 12:43:48 CEST 2016
While my first thought over the "String Type" or "End of World" threads
was this is another "how many angels to the pinhead" type discussion.
However, having worked through it, I believe that there is an issue here
and Pascal could be improved by including (for string types) the code
page as part of the string data itself rather than having to infer it.
As a programmer, I want the freedom to choose which was the appropriate
character encoding for my application - or even to mix encodings in the
same application.
- I would always choose UTF-8 for database columns as that is the best
compromise between international support and compact encoding (and hope
that my RDBMS was not so dumb as to allocate four times the max
character width for every UTF-8 string).
- If I was doing a lot of intensive CPU string processing of strings
with international support then UTF-16 is what I would want to use for
internal representation - as long as the cost of UTF-8 to UTF-16
transliteration was justified when reading/writing to disk.
- On the other hand, if I am working on an in house application that I
know is always going to be working in English (or Western Europe) then
use of a National Character set (or more likely ISO 10589-1) seems the
obvious choice.
Pascal does seem to support what I want. It has the unicodestring type
for UTF-16 and the string type (with code page) for UTF-8 and national
character sets. However, the problem is that Pascal (or FPC) permits an
ambiguity between the use of UTF-8 and national character sets.
If you program is in English and your data is in English then UTF-8 and
Ansistrings (or even different 8-bit code pages) look the same and is
very easy to get sloppy, use the basic string type all over the place,
and to get very confused as to what your string code page really is. The
whole thing then just falls apart when you try and internationalise it.
I would argue that this problem would be avoided if the code page was
part of the string data (just as the byte count is already) and that
strings defined without an explicit code page could have a string with
any code page assigned to them, while strings with an explicit code code
as part of their type could only be assigned a string of that code page
(perhaps with automatic transliteration on assignment from another code
page). Also, byte length and character length could then be returned by
standard routines.
This is in contrast to the current situation where strings without an
explicit code page setting are simply assumed to use the
DefaultSystemCodePage with limited run time checking (often none).
Indeed, if the code page was part of the string data, then the "string"
type should be able to unify both wide string and ansistrings.
More information about the fpc-pascal
mailing list