[fpc-devel] Unicode support (yet again)

Marco van de Voort marcov at stack.nl
Fri Sep 16 12:38:41 CEST 2011


In our previous episode, Felipe Monteiro de Carvalho said:

Note that this is all my, not necessarily core's opinion.

> On Thu, Sep 15, 2011 at 9:14 PM, Marco van de Voort <marcov at stack.nl> wrote:
> > The assignfile() etc routines are actually not the problem. The classes in
> > the classes unit are.
> 
> Ok, I may have exaggerated about the problems, but I still don't
> understand 100% your position. Where exactly is the frontier of how
> much utf8 support in the RTL is acceptable?

IMHO none. We have never made a choice for UTF8, and decided to go in a
Delphi compatible way that is both more compatible and powerful.

And I believe that route is also more avantageous long term for Lazarus.
 
> What do you think about adding TStringsUTF8/TStringListUTF8 to classes.pas?

I think this is a slippery slope. These kinds of hacks are slipped in one by
one, and each one is only a small concession, but in the end it is a
disaster. I don't want to create and maintain UTF8 versions of nearly every
class, even when the class doesn't actually do anything UTF8 specific.

So in my opinion: NO!

The constant pressure from the Lazaurs team was the main rationale to come
up with two RTLs. Since the original unicode discussion on core (early 2009,
just before 2009 came out) came up with a type that was mostly UTF16 on
Windows. 

People always whined about overloading as a solution, but that won't work
because of virtual methods with string parmaeters or returnvalues in it.

Moreover even if Lazarus decided to migrate to that, it would be totally
broken for a long while till it caught up.  The current route is friendlier.

In the UTF8 RTL, all "string"s _ARE_ utf8, unless specified otherwise (by
naming them unicodestring or ansistring(..encoding) or shortstrings).

So the same virtual method with a STRING parameter will be TUnicodestring
in the UTF16 rtl and UTF8string in the utf8 RTL.

An additional advantage is that Delphi code that just passes strings along won't need
modification (by replacing everything with -UTF8) versions. 

Most simple RTL routines that accept a string, but are not string type
specific (think fileopen createdir etc) accept rawbytestring, a type that
accepts all ansistring types and unicodestring. IOW you can also pass an
UTF8 to it, even in the UTF16 rtl.

I hope though that Lazarus in time will see the light and change the Windows
port to the UTF16 RTL, since when the manual conversions are removed, the
places where encoding matters decreases significantly. (and the places where
the automatic ones happen can vary without codechanges)

But that discussion is for after the big transitions, when all the current
manual hacks are removed, and as said that might only be when 2.8 is
released.

Actually I see only two solutions, either this or follow full Delphi/unicode
compatibility. It's a pity for Lazarus, but full Delphi/unicode
compatibility is an absolute must for me.   If not, we set us totally apart,
since the "old" delphi's will decrease in amount every year.

> What about stuff like this in classes:
> 
>   TReader = class(TFiler)
>     function ReadString: string;
>     function ReadWideString: WideString;
>     function ReadUnicodeString: UnicodeString;
> 
> Can a utf8string method be added where there are already multiple
> methods for various string types?

No UTF8 method. But "string" is utf8string in the UTF8 RTL.

But filer is special I guess, these methods might not be just converting
aliases.  It's too soon to say how all this work out in (how I see) 2.8. 
The rough designlines are visible and would be introduced bottom to top.
 
> And regardless of the answers above, I think we should start a new
> package in fpc called libutf8 which can add all kinds of classes and
> routines for utf8.

I don't see the point of that. I don't see the reason to move the
workarounds of Lazarus manual UTF8 conventions into the FPC repository that
doesn't support those conventions.  Specially since it is only for the 2.6
series that is already branched, because after that new solutions will
remain available.





More information about the fpc-devel mailing list