[fpc-devel] String and UnicodeString and UTF8String
Jonas Maebe
jonas.maebe at elis.ugent.be
Mon Jan 10 15:31:49 CET 2011
On 10 Jan 2011, at 13:57, Marco van de Voort wrote:
> In our previous episode, Jonas Maebe said:
>>>>
>>>> If/when this is done, it will only be with a compiler switch or
>>>> directive.
>>>
>>> (
>>> That won't be enough, since that would not change the relevant units
>>> and
>>> classes to such type. (e.g. tstringlist would remain defined
>>> ansistring)
>>
>> If it's a D2009-style ansistring, does that matter?
>
> A lot of conversion, since it will use ansistring(0) so reading/
> writing
> ansistring(cp_utf8) will force conversions. (0 means system
> encoding, $FFFF
> means never convert)
Why should a tstringlist force ansistring(0)? Or does Delphi force it
to be that way?
Conversion may indeed be required for output (input would only pass on
the encoding of the input if based on ansistring($ffff)), but I think
doing that only when necessary at the lowest level should be no
problem. Many existing frameworks work that way.
> Besides that the usual three problems:
>
> - I don't know how VAR behaves in this case. (passing a
> ansistring(cp_utf8) to a "var ansistring(0)" parameter),
var-parameters may indeed pose a problem in case some parameters of OS-
neutral routines are required to have a particular encoding specified.
> - maybe overloading (only cornercases?) etc.
Possibly, although I guess there are probably rules for that (whether
they are document is another case though, probably...)
> - inheritance. FPC defines base classes as ansistring(0) parameters,
> and
> Lazarus wants to inherit and override them with a different type.
> This will clash.
Why ansistring(0) for base classes? OS-level interfaces: yes, but why
base classes?
> I've thought long and hard about this. Since the discussion what the
> dominant type should be won't stop anytime soon, and we probably
> will have
> to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as
> basetypes in
> the long run, plus a time ANSI as legacy, the RTL has to be prepared
> for it
> anyway, we might as well allow this on all platforms from the start.
> (actually releasing them is a different question and depends on
> manpower)
I agree that the RTL should work regardless of the used string
encoding, but I don't see why a particular encoding should be enforced
throughout the entire RTL rather than just using ansistring($ffff)
almost everywhere.
I also agree that we should strive to minimize the number of
conversions in the RTL for some encodings (in particular indeed ansi,
utf-8 and utf-16), but again this should not require a specially
compiled RTL. E.g., insert(ansistring($ffff)),
delete(ansistring($ffff)), etc. can call to special-purpose versions
for certain specific encodings of the input (e.g., for the three you
mentioned), and only if the encoding is not directly supported or if
different encodings are mixed then perform a round trip via some
generic format (utf-16, utf-32, or something that depends on the
platform).
This has the advantage that you always have all optimal
implementations available, regardless of the platform or default
string encoding. It does not require extra work because we have to
write all those versions also if we want the RTL to be compilable for
different default string encodings. And three checks in a case
statement are not going to define the performance in a context of
atomic reference counting, dynamic memory management and the
occasional code page conversion (and since this may reduce the number
of code page conversions when working with "non-native" strings, it
can also be a performance win).
Outside the RTL, the encoding mainly matters if you perform manual low-
level processing of a string (for i:=1 to length(s) do
something_with(s[i])). But in that case your your code will either
work with only one encoding and you have to enforce it via the
parameter type anyway, or if it has to work with multiple encodings
and then you can use a technique similar to what I described above for
the RTL.
> That doesn't mean that a per unit switch is useless, but I think a
> target
> switch to fixate the bulk of the cases will save both us and the
> users a lot
> of grief.
It's not really clear to me which problem this would solve, but I may
be missing something.
Jonas
More information about the fpc-devel
mailing list