[fpc-pascal] Unicode file routines proposal

Tue Jul 1 11:21:57 CEST 2008

> On Tue, 1 Jul 2008 10:33:28 +0200 (CEST)
> > > all platforms?
> > 
> > My proposition was: Two encodings, two stringtypes for all. 
> 
> Both at the same time?

Yes, utf8string and utf16string. Whatever Tiburon introduces aliased to
utf16string, so that will be compat on non-windows too. And the utf16
tiburon code can easily communicate with the outside world.

> > Florian's stand was thinking about one stringtype that supports both
> > encodings. I don't like this, but we can only discuss that if Florian
> > has more details about his ideas.
> 
> I think, Marc had a similar idea. Adding an encoding field (e.g. in
> front of the length). But IMO it has some drawbacks.

Yes. Any manual string handling, that already gets more difficult, gets more
expensive.  Also because array dereference (which ignores surrogates, but is
still a baseblock for string routine implementation) becomes expensive, or
needs to be done with pointers.

> > It will on every communication with the external world. IOW all my db
> > exports will generally be UTF-8 on Unix and UTf-16 on Windows.
> 
> Maybe you misunderstood me here. This section is about multiple encoding
> proposal. So I was proposing to use only one string type in
> RTL/FCL. 

> It can be a different one for each platform.

Ok. That is somewhat different. One size fits all (UTF-16 everywhere) is not
an option for me. It's the way of the least resistance, but is more for
languages that have an ivory tower concept and want to keep the real world
at arms length.

So then different platforms, different encodings. Actually that was my first
thought/proposal too, but that precludes any possible solution for Tiburon
compability before we even start, and introduce a portability barrier. (want
to recompile for linux ? First fix all your UTF16 string routines so that
they support UTF-8 under ifdef. That is a hard sale)

IMHO that is no long term sustainable situation, so which is why I changed
to the two stringtypes solution.

That has some disadvantages too, most notably adding even more string types
and possible auto-conversion pitfalls. But I think it is an experiment that
should at least have been tried.

Note that this is totally separate from what Lazarus should do. Lazarus can
IMHO happily use the UTF16 string type exclusively. I'm concerned with the
base system.

> As long as almost everywhere only one string is used no conversion can
> take place and you can therefore store UTF8 in widestrings or UTF-16 in
> strings or whatever binary data.

It still requires manual conversion at the borders (any input or output to
system, libraries,disk). But a lot less since only sources in an encoding
"foreign" to the system need manually conversion code inserted.

> Just as it is at the moment. Strings are not only text. I think this
> concept is very important in pascal and breaking this will create a bigger
> incompatibility than Codegear does with it string to widestring move.

???

> > See above. If we have to support two totally different OS api's (A
> > and W) they are two different targets. Period.
> > 
> > This also avoids the mess of changing all windows routines to be
> > dynloaded, and hopefully lessen the mutual breaking a bit.
> 
> Two different windows targets. Wow, a big step.

Yes, but longterm unavoidable IMHO, to avoid the situation we had with Dos
in years past, where the port is always trailing the Tier 1 ports.
(though Giulio and Tomas managed to keep it working again I saw, but only
after releases of it were postponed)

W9x support is being dropped on all sides. However for me that is not
necesary if we split the stuff now, while the w9x support is still
qualitively ok. Even though w9x and NT are both windows, in some ways they
differ more than e.g. FreeBSD and Linux.

Doing the split before major NT requiring changes (read:unicode, but also
e.g. symlink support?) will make the change more evolutionary, and the
branching from a moment where the codebase is still proven to work on w32
will assure that it will have decent quality for quite some time.

In the long term it will also save a lot of work, like crazy attempts
tomaintain the status quo with insane workarounds like dynloading all api
routines etc.