[fpc-devel] Unicode support (yet again)

Fri Sep 16 13:39:51 CEST 2011

In our previous episode, Jonas Maebe said:
> > disaster. I don't want to create and maintain UTF8 versions of  
> > nearly every
> > class, even when the class doesn't actually do anything UTF8 specific.
> 
> If we support an UTF-8 version of the RTL, then either the code must  
> work both for UTF-16 and UTF-8, or it has to be separately maintained  
> anyway. And if the code works for both types, generics should enable  
> having both without source code duplication.

The -UTF8 hack is simply because of inheritance and the because NOW they
still need to insert manual conversions.

In the UTF8 and UTF16 rtl, the inheritance problems are gone by selecting
the proper RTL and the conversions are automated due to cpnewstr.

> Note that I'm not specifically arguing here for adding them or not,  
> but I don't think the maintenance will be much higher than what will  
> be imposed already by requiring that the RTL be compilable for both  
> UTF-8 and UTF-16.

I'm not sure that is really the case, both because only really encoding
sensitive places are affected in that case. Moreover, most of the remaining
changes will be necessary anyway because of Delphi2009 functionality, also
if we only do one RTL.

Stuff that gets input like TStringList.Loadfromfile will receive encoding
loading options anyway as per Delphi/unicode compat. (this becuase the
encoding of the file that you load must be runtime selectable, and is not
necessarily the same as your default encoding)

So I don't expect the recompiling of the RTL with for UTF8 and UTF16 to
require that much encoding specific changes. Basically it just parameterises
the classes trees with a stringtype, and maybe changes the names of some RTL string routines
which must be made for utf8 and utf16 anyway. (so more or less that in the
UTF* rtl e.g. trimleft is utf8 and trimleftutf16 is utf16, while in the
UTF16 rtl it is the otherway around, but one still has to create both
anyway in any solution)

The whole point is to avoid messing too much with overloading umpteen
stringtypes and mimimize changing existing code (both FPC, Delphi and
delphi/unicode) with suffixes (-UTF8 etc). Of course when one wants to blend
UTF16 Delphi code in a UTF8 rtl, fixes might be needed, but at least if your
code is _mostly_ UTF16, you have the option to go to the utf16 rtl.

And then speicifically avoid modifications to virtual methods. Having
multiple versions of virtual methods overloaded (same name or not) is very
dangerous, since people might only override the wrong one
( just see the seek32/64 case)

Moreover, nobody is wronged in the sense that "his" choice is second rate
and must go through conversion layers, and the general principles are clear
for everbody, without exhausting discussion at every single modification. 

At the expense of a few more release binaries to build, and dealing with
bugreports in a different RTL then you would typically use.  I admit that,
but still think that it is a netto plus by a wide margin, and the _extra_
work that needs to be done on the code itself is consistently overestimated. 

If we remain delphi compat and only allow UTF16, we will over time get
bugreports to make UTF8 variants for everything. Better require people to
submit a working solution for both from the beginning.

Every project (Lazarus, Delphi/Unicode compats, ansi legacy) picks a RTL fit
for their purpose, and just uses STRING for the bulk of the code.