[fpc-pascal] Unicode file routines proposal

Tue Jul 1 10:33:28 CEST 2008

> On Tue, 1 Jul 2008 09:23:52 +0200 (CEST)

(note that this is all IMHO, not necessarily core viewpoint)

> Are we talking about one encoding per platform or two encodings for
> all platforms?

My proposition was: Two encodings, two stringtypes for all. 

Florian's stand was thinking about one stringtype that supports both
encodings. I don't like this, but we can only discuss that if Florian has
more details about his ideas.

Maybe to not get the RTL unwieldy don't implement the more
exotic routines (like soundex and some of the more complicated formatting
routines) for all encodings. This will also allow to optimize binary size a
bit. (don't want all the routines two fold? Add a define when compiling the
RTL, and the relevant's encoding includefiles are not included and the
compiler inserts conversions). Not that I think that is that important
(except for CE maybe).

Note that I want to do some (if not most) of the RTL stringroutine work. I
like doing them, they are like little puzzles. Though it will take some time
to get proficient in unicode string coding.

> Under Unix the encoding preference is clear: UTF-8.
> Under Windows there are a lot of current code page texts and the
> UTF-16 W functions. So, what encoding is the preference under windows?
> UTF-16 plus Ansi like the A and W functions?

Split the win32 target into 
- a win9x compat + legacy codepages, using A
- NT unicode port that strictly -W.

The ports can share nearly all code, and use/perfect the already remaining
IFDEF UNICODE remains. (IOW the NT/UNICODE port defines UNICODE, the other not) 

> > > * Potentially will have a higher performance then a single encoding
> > > system, but only if you use this new special string type
> > 
> > Certainly. Can you imagine loading a non trivial file in a
> > tstringlist and saving it again and the heaps of conversions?
> 
> Auto conversion of the strings in a TStringList does not make much
> sense (and will break a lot of code). That's why I propose to keep one
> default string type.

> If almost everything uses one string type, then no
> conversion will take place. 

It will on every communication with the external world. IOW all my db
exports will generally be UTF-8 on Unix and UTf-16 on Windows.

This one size fits all attitude works fine for Lazarus, with only human
latency to worry about, and small amounts of data, (and that is already a
challenge to keep performant) but not for FPC as a whole, as all
processing is hit severely by it.

Most notably, in the single string case, the only way to avoid the forced
encoding is to everything OS specific and manual. That is IMHO too poor.

> I think the main problem is that the RTL calls the Ansi functions
> under windows. Maybe we should not loose the focus.

This is not about loosing focus but gaining it. Out with the evolutionary
workarounds and start making decisions.

> > * Does not make one of the two core platforms (Unix/windows)
> > effectively second rate.
> 
> Windows need per se at least two encodings. So whatever is decided, the
> windows part need some more work.

See above. If we have to support two totally different OS api's (A and W)
they are two different targets. Period.

This also avoids the mess of changing all windows routines to be dynloaded,
and hopefully lessen the mutual breaking a bit.