[fpc-pascal] String theory

Tue May 10 12:43:48 CEST 2016

While my first thought over the "String Type" or "End of World" threads 
was this is another "how many angels to the pinhead" type discussion. 
However, having worked through it, I believe that there is an issue here 
and Pascal could be improved by including (for string types) the code 
page as part of the string data itself rather than having to infer it.

As a programmer, I want the freedom to choose which was the appropriate 
character encoding for my application - or even to mix encodings in the 
same application.

- I would always choose UTF-8 for database columns as that is the best 
compromise between international support and compact encoding (and hope 
that my RDBMS was not so dumb as to allocate four times the max 
character width for every UTF-8 string).

- If I was doing a lot of intensive CPU string processing of strings 
with international support then UTF-16 is what I would want to use for 
internal representation - as long as the cost of UTF-8 to UTF-16 
transliteration was justified when reading/writing to disk.

- On the other hand, if I am working on an in house application that I 
know is always going to be working in English (or Western Europe) then 
use of a National Character set (or more likely ISO 10589-1) seems the 
obvious choice.

Pascal does seem to support what I want. It has the unicodestring type 
for UTF-16 and the string type (with code page) for UTF-8 and national 
character sets. However, the problem is that Pascal (or FPC) permits an 
ambiguity between the use of UTF-8 and national character sets.

If you program is in English and your data is in English then UTF-8 and 
Ansistrings (or even different 8-bit code pages) look the same and is 
very easy to get sloppy, use the basic string type all over the place,  
and to get very confused as to what your string code page really is. The 
whole thing then just falls apart when you try and internationalise it.

I would argue that this problem would be avoided if the code page was 
part of the string data (just as the byte count is already) and that 
strings defined without an explicit code page could have a string with 
any code page assigned to them, while strings with an explicit code code 
as part of their type could only be assigned a string of that code page 
(perhaps with automatic transliteration on assignment from another code 
page). Also, byte length and character length could then be returned by 
standard routines.

This is in contrast to the current situation where strings without an 
explicit code page setting are simply assumed to use the 
DefaultSystemCodePage with limited run time checking (often none).

Indeed, if the code page was part of the string data, then the "string" 
type should be able to unify both wide string and ansistrings.