[fpc-devel] String and UnicodeString and UTF8String

Mon Jan 10 15:31:49 CET 2011

On 10 Jan 2011, at 13:57, Marco van de Voort wrote:

> In our previous episode, Jonas Maebe said:
>>>>
>>>> If/when this is done, it will only be with a compiler switch or
>>>> directive.
>>>
>>> (
>>> That won't be enough, since that would not change the relevant units
>>> and
>>> classes to such type. (e.g. tstringlist would remain defined
>>> ansistring)
>>
>> If it's a D2009-style ansistring, does that matter?
>
> A lot of conversion, since it will use ansistring(0) so reading/ 
> writing
> ansistring(cp_utf8) will force conversions. (0 means system  
> encoding, $FFFF
> means never convert)

Why should a tstringlist force ansistring(0)? Or does Delphi force it  
to be that way?

Conversion may indeed be required for output (input would only pass on  
the encoding of the input if based on ansistring($ffff)), but I think  
doing that only when necessary at the lowest level should be no  
problem. Many existing frameworks work that way.

> Besides that the usual three problems:
>
> - I  don't know how VAR behaves in this case. (passing a  
> ansistring(cp_utf8) to a "var ansistring(0)" parameter),

var-parameters may indeed pose a problem in case some parameters of OS- 
neutral routines are required to have a particular encoding specified.

> - maybe overloading (only cornercases?) etc.

Possibly, although I guess there are probably rules for that (whether  
they are document is another case though, probably...)

> - inheritance. FPC defines base classes as ansistring(0) parameters,  
> and
>   Lazarus wants to inherit and override them with a different type.  
> This will clash.

Why ansistring(0) for base classes? OS-level interfaces: yes, but why  
base classes?

> I've thought long and hard about this. Since the discussion what the
> dominant type should be won't stop anytime soon, and we probably  
> will have
> to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as  
> basetypes in
> the long run, plus a time ANSI as legacy, the RTL has to be prepared  
> for it
> anyway, we might as well allow this on all platforms from the start.
> (actually releasing them is a different question and depends on  
> manpower)

I agree that the RTL should work regardless of the used string  
encoding, but I don't see why a particular encoding should be enforced  
throughout the entire RTL rather than just using ansistring($ffff)  
almost everywhere.

I also agree that we should strive to minimize the number of  
conversions in the RTL for some encodings (in particular indeed ansi,  
utf-8 and utf-16), but again this should not require a specially  
compiled RTL. E.g., insert(ansistring($ffff)),  
delete(ansistring($ffff)), etc. can call to special-purpose versions  
for certain specific encodings of the input (e.g., for the three you  
mentioned), and only if the encoding is not directly supported or if  
different encodings are mixed then perform a round trip via some  
generic format (utf-16, utf-32, or something that depends on the  
platform).

This has the advantage that you always have all optimal  
implementations available, regardless of the platform or default  
string encoding. It does not require extra work because we have to  
write all those versions also if we want the RTL to be compilable for  
different default string encodings. And three checks in a case  
statement are not going to define the performance in a context of  
atomic reference counting, dynamic memory management and the  
occasional code page conversion (and since this may reduce the number  
of code page conversions when working with "non-native" strings, it  
can also be a performance win).

Outside the RTL, the encoding mainly matters if you perform manual low- 
level processing of a string (for i:=1 to length(s) do  
something_with(s[i])). But in that case your your code will either  
work with only one encoding and you have to enforce it via the  
parameter type anyway, or if it has to work with multiple encodings  
and then you can use a technique similar to what I described above for  
the RTL.

> That doesn't mean that a per unit switch is useless, but I think a  
> target
> switch to fixate the bulk of the cases will save both us and the  
> users a lot
> of grief.

It's not really clear to me which problem this would solve, but I may  
be missing something.

Jonas