[fpc-pascal] Unicode file routines proposal

Mattias Gaertner nc-gaertnma at netcologne.de
Tue Jul 1 10:56:15 CEST 2008


On Tue, 1 Jul 2008 10:33:28 +0200 (CEST)
marcov at stack.nl (Marco van de Voort) wrote:

> > On Tue, 1 Jul 2008 09:23:52 +0200 (CEST)
> 
> (note that this is all IMHO, not necessarily core viewpoint)

Same for me: mine are not lazarus core.

 
> > Are we talking about one encoding per platform or two encodings for
> > all platforms?
> 
> My proposition was: Two encodings, two stringtypes for all. 

Both at the same time?


> Florian's stand was thinking about one stringtype that supports both
> encodings. I don't like this, but we can only discuss that if Florian
> has more details about his ideas.

I think, Marc had a similar idea. Adding an encoding field (e.g. in
front of the length). But IMO it has some drawbacks.

 
> Maybe to not get the RTL unwieldy don't implement the more
> exotic routines (like soundex and some of the more complicated
> formatting routines) for all encodings. This will also allow to
> optimize binary size a bit. (don't want all the routines two fold?
> Add a define when compiling the RTL, and the relevant's encoding
> includefiles are not included and the compiler inserts conversions).
> Not that I think that is that important (except for CE maybe).
> 
> Note that I want to do some (if not most) of the RTL stringroutine
> work. I like doing them, they are like little puzzles. Though it will
> take some time to get proficient in unicode string coding.
> 
> > Under Unix the encoding preference is clear: UTF-8.
> > Under Windows there are a lot of current code page texts and the
> > UTF-16 W functions. So, what encoding is the preference under
> > windows? UTF-16 plus Ansi like the A and W functions?
> 
> Split the win32 target into 
> - a win9x compat + legacy codepages, using A
> - NT unicode port that strictly -W.
> 
> The ports can share nearly all code, and use/perfect the already
> remaining IFDEF UNICODE remains. (IOW the NT/UNICODE port defines
> UNICODE, the other not) 

I guess, that means only one at a time.

 
>[...]
> > Auto conversion of the strings in a TStringList does not make much
> > sense (and will break a lot of code). That's why I propose to keep
> > one default string type.
> 
> > If almost everything uses one string type, then no
> > conversion will take place. 
> 
> It will on every communication with the external world. IOW all my db
> exports will generally be UTF-8 on Unix and UTf-16 on Windows.

Maybe you misunderstood me here. This section is about multiple encoding
proposal. So I was proposing to use only one string type in
RTL/FCL. It can be a different one for each platform.
As long as almost everywhere only one string is used no conversion can
take place and you can therefore store UTF8 in widestrings or UTF-16 in
strings or whatever binary data. Just as it is at the moment. Strings
are not only text. I think this concept is very important in pascal and
breaking this will create a bigger incompatibility than Codegear does
with it string to widestring move.

 
> This one size fits all attitude works fine for Lazarus, with only
> human latency to worry about, and small amounts of data, (and that is
> already a challenge to keep performant) but not for FPC as a whole,
> as all processing is hit severely by it.
> 
> Most notably, in the single string case, the only way to avoid the
> forced encoding is to everything OS specific and manual. That is IMHO
> too poor.
> 
> > I think the main problem is that the RTL calls the Ansi functions
> > under windows. Maybe we should not loose the focus.
> 
> This is not about loosing focus but gaining it. Out with the
> evolutionary workarounds and start making decisions.

ok

 
> > > * Does not make one of the two core platforms (Unix/windows)
> > > effectively second rate.
> > 
> > Windows need per se at least two encodings. So whatever is decided,
> > the windows part need some more work.
> 
> See above. If we have to support two totally different OS api's (A
> and W) they are two different targets. Period.
> 
> This also avoids the mess of changing all windows routines to be
> dynloaded, and hopefully lessen the mutual breaking a bit.

Two different windows targets. Wow, a big step.

Mattias



More information about the fpc-pascal mailing list