[fpc-pascal] FPC 3 regression: cannot use TStringList for UTF-8 data any more?

Wed Apr 20 11:26:19 CEST 2016

Michael Schnell wrote on Tue, 19 Apr 2016:

> On 04/19/2016 08:22 AM, Jonas Maebe wrote:
>> When any {$codepage xxx} directive is specified, string constants  
>> in the source are represented in a way that makes lossless  
>> conversion to any other code page possible. This conversion to the  
>> target code page is performed at compile time where possible (when  
>> the target code page cannot change at run time), and otherwise at  
>> run time.
>>
> Of course I do understand that.
>
> But anyway, AFAIK, UTF8 already is a way of lossless coding, so I  
> don't see a forcing necessity to convert that to UTF16 at compile  
> time. And as far as I understand, if the user does not take some  
> means, the executable will work with 8 bit coding and very likely  
> with UTF8, > so holding the constants as UTF16 increases as well  
> memory as CPU resource usage.

The reasons are
a) the FPC compiler binary itself prior to 3.x did not contain any  
UTF-8 encoding support. All it could do was convert the source file  
code page to UTF-16.
b) FPC's widestring manager does not contain any interface to directly  
convert from one 8 bit encoding to another, only from 16 bit to 8 bit  
and vice versa (which also made it useless to convert to UTF-8 at  
compile time, since there was no way to convert it to another code  
page at run time except by making a round trip via UTF-16 anyway). The  
reason is that these helpers were already necessary to convert between  
widestring/unicodestring and other types when assigning such variables  
to each other
c) changing b) would require a lot of testing because not all code  
page conversion libraries/OS interfaces support converting from any  
arbitrary character set to any other arbitrary character set. While it  
is likely that they all support converting from arbitrary code pages  
to UTF-8 and back (like they do for UTF-16), this would still have to  
be tested and additionally such an interface would undoubtedly also  
starting to get used for other code pages by people unaware of this  
limitation. Adding an interface limited to converting from/to UTF-8  
would be another option to address that though.

In the end it would be a lot of work, result in a lot of extra code  
that may not work everywhere (or in specialised routines), and it  
would be for a use case you can already address yourself if you  
absolutely want to be completely UTF-8-centric: you can declare your  
string variables as UTF8Stringm since then the conversion to UTF-8 for  
constant strings will happen at compile time. If your  
DefaultSystemCodePage is CP_UTF8, no extra conversion will happen when  
assigning/converting these variables to regular ansistrings.  
Furthermore, converting from UTF-8 to other code pages is probably  
slower than from UTF-16, since UTF-8 is a more complex encoding for  
most characters.

The fact that other things are less convenient if you use UTF8String  
is the price you will have to pay for such code specialisation (which  
probably won't make any noticeable difference in 99.999% of the cases  
anyway).

Jonas