[fpc-pascal] FPC 3 regression: cannot use TStringList for UTF-8 data any more?
Jonas Maebe
jonas.maebe at elis.ugent.be
Wed Apr 20 11:26:19 CEST 2016
Michael Schnell wrote on Tue, 19 Apr 2016:
> On 04/19/2016 08:22 AM, Jonas Maebe wrote:
>> When any {$codepage xxx} directive is specified, string constants
>> in the source are represented in a way that makes lossless
>> conversion to any other code page possible. This conversion to the
>> target code page is performed at compile time where possible (when
>> the target code page cannot change at run time), and otherwise at
>> run time.
>>
> Of course I do understand that.
>
> But anyway, AFAIK, UTF8 already is a way of lossless coding, so I
> don't see a forcing necessity to convert that to UTF16 at compile
> time. And as far as I understand, if the user does not take some
> means, the executable will work with 8 bit coding and very likely
> with UTF8, > so holding the constants as UTF16 increases as well
> memory as CPU resource usage.
The reasons are
a) the FPC compiler binary itself prior to 3.x did not contain any
UTF-8 encoding support. All it could do was convert the source file
code page to UTF-16.
b) FPC's widestring manager does not contain any interface to directly
convert from one 8 bit encoding to another, only from 16 bit to 8 bit
and vice versa (which also made it useless to convert to UTF-8 at
compile time, since there was no way to convert it to another code
page at run time except by making a round trip via UTF-16 anyway). The
reason is that these helpers were already necessary to convert between
widestring/unicodestring and other types when assigning such variables
to each other
c) changing b) would require a lot of testing because not all code
page conversion libraries/OS interfaces support converting from any
arbitrary character set to any other arbitrary character set. While it
is likely that they all support converting from arbitrary code pages
to UTF-8 and back (like they do for UTF-16), this would still have to
be tested and additionally such an interface would undoubtedly also
starting to get used for other code pages by people unaware of this
limitation. Adding an interface limited to converting from/to UTF-8
would be another option to address that though.
In the end it would be a lot of work, result in a lot of extra code
that may not work everywhere (or in specialised routines), and it
would be for a use case you can already address yourself if you
absolutely want to be completely UTF-8-centric: you can declare your
string variables as UTF8Stringm since then the conversion to UTF-8 for
constant strings will happen at compile time. If your
DefaultSystemCodePage is CP_UTF8, no extra conversion will happen when
assigning/converting these variables to regular ansistrings.
Furthermore, converting from UTF-8 to other code pages is probably
slower than from UTF-16, since UTF-8 is a more complex encoding for
most characters.
The fact that other things are less convenient if you use UTF8String
is the price you will have to pay for such code specialisation (which
probably won't make any noticeable difference in 99.999% of the cases
anyway).
Jonas
More information about the fpc-pascal
mailing list