[fpc-devel] Unicode support (yet again)

Marco van de Voort marcov at stack.nl
Thu Sep 15 21:14:56 CEST 2011


In our previous episode, Felipe Monteiro de Carvalho said:
> And I say more, two RTLs will immediately cause problems in all kinds
> of libraries. 

Why?

> Will the FCL work with the Ansi RTL, with the Unicode
> RTL, with both?

Generally both, and problematic packages are not coded in "string" but in
"utf8string" or "unicodestring"

> Won't that duplicate the difficulty in maintaining it
> if it has to be tested to work in both Ansi and Unicode mode? 

No. The few cases where it really matters will be excempted. The whole idea
is that the _general_ code follows suit, not _all_ code.

> Will all packages from FPC work perfectly in both modes?  Is that really
> feasable?  How?  {$ifdef UNICODE} everywhere that uses strings?

This is totally ignorant FUD. Please, if you don't understand the proposal,
or Delphi's unicode implementation refrain from commenting on it.

> And what if people start writing packages for the Unicode RTL, without
> IFDEFs to be compatible with Ansi, then LCL users cannot use them?
> What if you have a package for Ansi and another for Unicode, they
> cannot be used at the same time?

The compiler and remains fully UTF8 and UTF16 capable. Using the default
"string" type only means statistically less automated conversions. Properly
tagged string remain working but are less optimal outside their natural realm.
(read UTF8 on Windows and vice versa)

RTL routines like assign(file) will probably be rawbytestring, which allows
to pass utf8string and utf16string in a way that remains zero copy for utf16
on Windows and for utf8 on unix (assuming that is the default encoding on Unix)

Only for the classes lib this is harder, due to the fact that rawbytestring
has a cumbersome implementation, and thus is probably not fit for virtual
methods.

> This is not a stable solution, for Lazarus I see only 2 stable
> solutions assuming that my proposal is completely off the table:
> 
> 1> Migrate to UnicodeString
> or
> 2> Implement everything that we need in the LCL to substitute RTL
> routines which include string or file handling (and we are not really
> that far away from finishing that at all...)

IMHO these are not the only options. IMHO Lazarus should simply flag
everything as UTF8 and use the UTF8 RTLs as a base, and then reconsider if
they stay that way long term, or migrate to a native encoding for each
platform or not.  The whole idea of the dual RTLs is that I'm not arrogant
enough to pretend how everything with Lazarus will work out(or MSEGUI, or how long people
will keep wanting to port ansi code). And don't worry, I can
actually muster quite some arrogance if necessary, just not THAT much
arrogance.

It is just plain common sense that we can't develop as quickly as Delphi and
do the whole transition into every corner of the project within maximally an
year. (even IF we all faced the same direction, we all have real lives and
practical matters with existing codebases)

The dual RTL model allows to do this in a careful and planned manner, and
without a bigbang period where Lazarus is unusable for half an year. 
 
> Porting all of the libraries and applications which I use for
> UnicodeString will likely be so huge of a task that I don't see any
> feasability in this kind of thing.

IMHO if we don't go for the dual RTL model (which is mainly conceived
because of Lazarus' needs), then full Delphi/Unicode compatibility with
string=unicodestring is the only longterm solution.

While I prefer to avoid rough breaks, like Embarcadero did by declaring all 
existing codebases legacy, it is the only route that is sustainable long
term if we want to keep being Delphi compatible.

An own course will just be painless at first, and cost us dearly in  the
end. Remaining ansi only actually might be an easier course then.

> We need to port SynEdit, the LCL,
> fpvectorial, regexpr... lots of applications which I wrote and nearly
> all of them do some kind of text parsing or handling...

You will to change all string types that MUST contain utf8 to utf8string and
kill all conversions from other types.

That was the original idea, but that will bring you into trouble with
inheritance on the UTF16 platforms (where the classes unit will be UTF16). Hence
the idea of a dual RTL, at least for the transition period.
 
The assignfile() etc routines are actually not the problem. The classes in
the classes unit are.



More information about the fpc-devel mailing list