[fpc-devel] String and UnicodeString and UTF8String

Mon Jan 10 18:26:55 CET 2011

On 10 Jan 2011, at 16:27, Marco van de Voort wrote:

> In our previous episode, Jonas Maebe said:
> 
>> Why should a tstringlist force ansistring(0)?
> 
> I mean that if you locally (for your units) set string=utf8string,
> TStringList still would be ansistring(0) or whatever the default becomes. 

I meant: why not use ansistring($ffff) instead? You could even add a property to tstringlist that causes it to force the encoding of added strings to a particular code page whenever a string is added.

>> Or does Delphi force it  to be that way?
> 
> In D2009+ it is unicodestring, period. Everything is unicodestring (UTF16),
> ansistring (+ variants) are for legacy only, and people try to forget
> shortstring as quickly as possible.

Then a unicodestring version is certainly required, and an ansistring($ffff) version would have to be called differently.

> I think in the planned Embarcadero cross-compile products, string will also
> be utf-16 on OS X and Linux.  If only because it is (1) easier, and windows
> remains dominant by far (including UTF16 assuming codebases) (2) they plan
> to target QT. 

I think it's a good decision to keep it the same everywhere, since string=unicodestring is not an opaque type in any way. As a result, choosing a different string type on other platforms would probably break lots of code again. And regardless of which toolkit you target on Mac OS X, conversions will probably happen anyway. The encoding used by Carbon and Cocoa is not specified anywhere afaik, and the CFString/NSString they are based on can use any encoding internally (I guess that's probably also UTF-16 for ease of processing).

> For me, having a mandatory UTF16 Unix is not an option, and a mandatory UTF8
> Windows neither.  (D2009+ incompatible)

I don't think UTF-16 everywhere would be a big problem.

>> Conversion may indeed be required for output (input would only pass on  
>> the encoding of the input if based on ansistring($ffff))
> 
> ansistring(0), system encoding would be more logical than $ffff. $ffff is
> used more internally in string conversion routines and for strings that are
> not strings.

The fact that the formal return type is $ffff does not mean, afaik, that you also have to return something whose internal encoding is set to "$ffff". It can still be an ansistring(0), ansistring(OEMSTRING) or whatever. It simply means that the encoding won't be forced to anything in particular when you assign a value to the function result. If you then assign this function result to another variable (which may have a forced encoding), then a conversion will happen if the forced encoding is different from the actual one. If you assign it to another ansistring($ffff), no encoding change will happen in any case, and the destination string will "inherit" the source's encoding.

> But what does that mean on Windows, where the console encoding is OEMSTRING
> and not ansistring(0) ?  

As I said: ansistring($ffff). 

>> but I think doing that only when necessary at the lowest level should be
>> no problem.  Many existing frameworks work that way.
> 
> It touches all places where you touch the OS. But indeed one could try to
> split this by doing the classes utf8 or tunicodestring depending on OS.

I'm not sure why you say "indeed", because I did not propose to do that. I only proposed keeping as many RTL interfaces as possible in ansistring($ffff) to have something that's
a) generic, and
b) with the least chance of resulting in encoding conversion

However...

> And we have to deal with Windows, where the default is UTF16.

... since Delphi 2009 uses (unicode)string everywhere, we need at least also unicode versions.

>> Why ansistring(0) for base classes? OS-level interfaces: yes, but why  
>> base classes?
> 
> This is the core problem. What solution will do for everybody
> (legacy,Lazarus,Delphi/unicode?) or (ansistring(0), ansistring(cp_utf8) or
> TUnicodestring) ?
> 
> And what do we do if e.g. Lazarus changes opinion and goes from utf8 to
> utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx).
> 
> And do we really want Lazarus' direction to fixate this for everybody?
> 
> Or what if they bring in a new Kylix principle with utf16 base type?

A unicodestring version for Delphi-compatibility, and if required an ansistring($ffff) version for all other purposes (afaik that would also work with legacy ansistring=ansistring(0), although it's not yet clear to me what happens if you pass an empty ansistring(0) to a rawbytestring var-parameter -- is it still nil like with current ansitrings, or can you somehow extract its declared encoding?)

>> I agree that the RTL should work regardless of the used string  
>> encoding, but I don't see why a particular encoding should be enforced  
>> throughout the entire RTL rather than just using ansistring($ffff)  
>> almost everywhere.
> 
> That only solves the 1-byte case.

It's true that you probably need a separate overloaded version for unicodestring (just like we currently also have separate overloads for ansistring and unicodestring).

> And while that solves some of the
> overloading problems deep in RTL and frameworks, it might not be applicable
> on largescale, since afaik you need to test in the routine for codepages
> manually ? IOW this can't be done in every routine with a string parameter
> in the entire classtree

Most routines don't process strings themselves: they store them, pass them on (to routines that may process them) or return them. In those cases, you don't have to look at the encoding.

> Btw, while looking up rawbytestring I saw this in the Delphi help:
> 
> "Declaring variables or fields of type RawByteString should rarely, if ever,
> be done, because this practice can lead to undefined behavior and potential
> data loss."

They are right if you mainly care about code maintainability. If you however insist on supporting multiple encodings efficiently and transparently, there is no other option. The danger they are talking about mainly occurs when mixing rawbytestring and string literals. And even that could actually be solved by the compiler (the compiler could insert a conversion of the string literal to the actual encoding of the rawbytestring at run time, just like we currently do for mixing widestring constants and ansistring), but CodeGear chose not to do that, presumably for efficiency reasons.

> How will you deal with e.g. Windows? Legacy string=ansistring(0), D2009 is
> string=utf16 TUnicodestring?
> 
> These are not the same types, and inheritance and the other problems will
> kill you if you attempt to combine it. We need two separate targets for Windows
> anyway. Maybe three (if Lazarus persists in UTF8 in windows)

I think at most two are required for any target: unicodestring (D2009 compatibility), and if really necessary because somehow the unicodestring version causes too much overhead, an ansistring($ffff) version as well. That's only for the classes though, I think most of the base RTL can be simply ansistring($ffff).

>> Outside the RTL, the encoding mainly matters if you perform manual low-
>> level processing of a string (for i:=1 to length(s) do
>> something_with(s[i])). 
> 
> The RTL is not the only interface with the OS. Like e.g. a widget set that
> may be ansi,UTF8 or UTF16.

Changing the string type in your entire application and RTL only because a widgetset uses it does not make sense to me. Generally, you want to process everything in whatever format is most convenient, and only convert it to the required type once you are actually communicate with the component.

>> It's not really clear to me which problem this would solve, but I may  
>> be missing something.
> 
> Mainly the question what the classtree will be. The main operating type used
> in applications.  You always need two RTLs for that, since it can be 1 or 2
> byte, and even if you fixated it on one byte encodings, rawbytestring would
> force you to write case statements in each and every procedure.

That last part is not true. It's only required in those cases where strings are directly manipulated and where the overhead is very important. And again: if you want to support compiling the RTL for different one-byte code pages that operate directly on those strings without any conversions, you have to write all that different code anyway. It's mainly a matter of replacing ifdef's with case statements.

Jonas