[fpc-devel] String and UnicodeString and UTF8String

Mon Jan 10 16:27:19 CET 2011

In our previous episode, Jonas Maebe said:

> >> If it's a D2009-style ansistring, does that matter?
> >
> > A lot of conversion, since it will use ansistring(0) so reading/ 
> > writing
> > ansistring(cp_utf8) will force conversions. (0 means system  
> > encoding, $FFFF
> > means never convert)
> 
> Why should a tstringlist force ansistring(0)?

I mean that if you locally (for your units) set string=utf8string,
TStringList still would be ansistring(0) or whatever the default becomes. 
(and it could be UTF16 even)
Since TStringList inherits from TStrings so would most Lazarus components.

> Or does Delphi force it  to be that way?

In D2009+ it is unicodestring, period. Everything is unicodestring (UTF16),
ansistring (+ variants) are for legacy only, and people try to forget
shortstring as quickly as possible.

Backwards compatibility to pre D2009 is essentially abandonned. I think they
didn't even try for exactly the reasons I mean to address here.

I think in the planned Embarcadero cross-compile products, string will also
be utf-16 on OS X and Linux.  If only because it is (1) easier, and windows
remains dominant by far (including UTF16 assuming codebases) (2) they plan
to target QT. 

Keep in mind that soon it will not be possible to upgrade from ansistring to
a current version anymore (and something like D5..D7 already is no longer
upgradable).  Embarcadero changed the upgrade rules.

>From Delphi related forums and maillist, I get the impression that most
fulltime delphi programmers migrated to unicode, and the occasional and
legacy users not. The gap between these two groups is widening, but contrary
to Embarcadero, we will be dealing with significant portions of both groups
for a while (as new/existing users) 
--------

So the question is how we are going to deal with this information, without
forcing a big bang like Embarcadero did, prepare to support both (or more? 
see below) schemes for a while, _AND_ deal with the fact that UTF16 is
mostly alien on non-Windows.

For me, having a mandatory UTF16 Unix is not an option, and a mandatory UTF8
Windows neither.  (D2009+ incompatible)

Since no one choice with one default type per target (or even one to rule
them all) will satisfy anybody, I was thinking about setting up multiple
targets.

Of course it is uncharted territory, and while I lean towards that solution,
it could be that there are hidden caveats.

> Conversion may indeed be required for output (input would only pass on  
> the encoding of the input if based on ansistring($ffff))

ansistring(0), system encoding would be more logical than $ffff. $ffff is
used more internally in string conversion routines and for strings that are
not strings.

But what does that mean on Windows, where the console encoding is OEMSTRING
and not ansistring(0) ?  

> but I think doing that only when necessary at the lowest level should be
> no problem.  Many existing frameworks work that way.

It touches all places where you touch the OS. But indeed one could try to
split this by doing the classes utf8 or tunicodestring depending on OS.

And we have to deal with Windows, where the default is UTF16.

> > Besides that the usual three problems:
> >
> > - I  don't know how VAR behaves in this case. (passing a  
> > ansistring(cp_utf8) to a "var ansistring(0)" parameter),
> 
> var-parameters may indeed pose a problem in case some parameters of OS- 
> neutral routines are required to have a particular encoding specified.

> > - maybe overloading (only cornercases?) etc.
> 
> Possibly, although I guess there are probably rules for that (whether  
> they are document is another case though, probably...)

> > - inheritance. FPC defines base classes as ansistring(0) parameters,  
> > and
> >   Lazarus wants to inherit and override them with a different type.  
> > This will clash.
> 
> Why ansistring(0) for base classes? OS-level interfaces: yes, but why  
> base classes?

This is the core problem. What solution will do for everybody
(legacy,Lazarus,Delphi/unicode?) or (ansistring(0), ansistring(cp_utf8) or
TUnicodestring) ?

And what do we do if e.g. Lazarus changes opinion and goes from utf8 to
utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx).

And do we really want Lazarus' direction to fixate this for everybody?

Or what if they bring in a new Kylix principle with utf16 base type?

I'm very reluctant to make a choice here, and say "insert conversions" if
something changes. I would build in some flexibility and potential
differentiation from the start. 

At least in principle. As said, we can see which combinations are popular
for release time. 

> > I've thought long and hard about this. Since the discussion what the
> > dominant type should be won't stop anytime soon, and we probably  
> > will have
> > to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as  
> > basetypes in
> > the long run, plus a time ANSI as legacy, the RTL has to be prepared  
> > for it
> > anyway, we might as well allow this on all platforms from the start.
> > (actually releasing them is a different question and depends on  
> > manpower)
> 
> I agree that the RTL should work regardless of the used string  
> encoding, but I don't see why a particular encoding should be enforced  
> throughout the entire RTL rather than just using ansistring($ffff)  
> almost everywhere.

That only solves the 1-byte case. And while that solves some of the
overloading problems deep in RTL and frameworks, it might not be applicable
on largescale, since afaik you need to test in the routine for codepages
manually ? IOW this can't be done in every routine with a string parameter
in the entire classtree

Btw, while looking up rawbytestring I saw this in the Delphi help:

"Declaring variables or fields of type RawByteString should rarely, if ever,
be done, because this practice can lead to undefined behavior and potential
data loss."

> I also agree that we should strive to minimize the number of  
> conversions in the RTL for some encodings (in particular indeed ansi,  
> utf-8 and utf-16), but again this should not require a specially  
> compiled RTL. E.g., insert(ansistring($ffff)),  
> delete(ansistring($ffff)), etc. can call to special-purpose versions  
> for certain specific encodings of the input (e.g., for the three you  
> mentioned), and only if the encoding is not directly supported or if  
> different encodings are mixed then perform a round trip via some  
> generic format (utf-16, utf-32, or something that depends on the  
> platform).

How will you deal with e.g. Windows? Legacy string=ansistring(0), D2009 is
string=utf16 TUnicodestring?

These are not the same types, and inheritance and the other problems will
kill you if you attempt to combine it. We need two separate targets for Windows
anyway. Maybe three (if Lazarus persists in UTF8 in windows)

The same if you have some large D2009 server app and want to migrate to
FPC/Unix.  While I don't like UTF16 on Unix, having it an option might make
this an workable option.

> This has the advantage that you always have all optimal implementations
> available, regardless of the platform or default string encoding.

But only for a handful of selected routines. The solution stops above the
procedural level of the RTL, and there is no solution for code that assumes
something else then the string type of the classes library was compiled
with.

And there are three such cases

- normal FPC and Delph 2007- code :  ansistring(0)
- Lazarus : ansistring=utf8  
- Delphi 2009+  UTF16.

shortstring is not really a primary stringtype anymore in the library sense,
so that matters less. The few units (dos, maybe fv/ide) are isolated.

Depending on lazarus choices it might become less.

> Outside the RTL, the encoding mainly matters if you perform manual low-
> level processing of a string (for i:=1 to length(s) do
> something_with(s[i])). 

The RTL is not the only interface with the OS. Like e.g. a widget set that
may be ansi,UTF8 or UTF16.

> But in that case your your code will either work with only one encoding
> and you have to enforce it via the parameter type anyway, or if it has to
> work with multiple encodings and then you can use a technique similar to
> what I described above for the RTL.

No. The two are not mutually exclusive. I see yours as an implementation
detail how to write a relative agnostic RTL. 

IOW it means that we can change the basetype easier with less conditional
code.

> > That doesn't mean that a per unit switch is useless, but I think a
> > target switch to fixate the bulk of the cases will save both us and the
> > users a lot of grief.
> 
> It's not really clear to me which problem this would solve, but I may  
> be missing something.

Mainly the question what the classtree will be. The main operating type used
in applications.  You always need two RTLs for that, since it can be 1 or 2
byte, and even if you fixated it on one byte encodings, rawbytestring would
force you to write case statements in each and every procedure.

Keep in mind that the Lazarus classtree is rooted in the RTL.