cpstrnew branch (was Re: [fpc-devel] Freepascal 2.4.0rc1 released)

Thu Nov 12 11:47:47 CET 2009

On Thu, November 12, 2009 08:56, Marco van de Voort wrote:
> In our previous episode, Tomas Hajny said:
>> > > supported codepages in the next version of MS Windows (or that they
>> don't
>> > > support a different list in some special version, like a version for
>> the
>> > > Chinese market) breaking your selection of "50 free values in
>> Windows
>> > > range"?
>> >
>> > In that unlikely case, change the range.
>>
>> That raises a question whether incompatibility between two FPC
>> versions
>
> Incompatibility how exactly? Two different FPC versions are already not
> compatible.

If you need to change the used range between e.g. FPC 2.6.x and 2.8.x (due
to MS extending their use of the codepage values into the range we decided
to use in FPC), this makes 2.6.x and 2.8.x incompatible to each other,
right?

>> is better than incompatibility between FPC and Delphi (caused by tight
>> connection between Delphi and one particular platform)...
>
> That would be source incompatibility, and therefore much worse.

First, this may be the case for compatibility between two FPC versions
too. Second, the relation between the numeric values appearing in FPC
sources and how the compiler translates the sources to the internal
representation in memory (which is possibly only valid for the particular
platform) is something that may not be the same (depending on the use
cases, of course).

>> > Like about 50/280. That's the point of "most used". For the less
>> likely
>> > ones, define constants to the windows codepages.
>>
>> I don't understand what you mean by "define constants to the windows
>> codepages".
>
> The 16-bit range is split between a short FPC range and a long
> Delphi/Windows range. Rarely used codepages use the windows codepage
> number,
> and if foreign OSes support that, they must implement a windows2local
> codepage number conversion.

As far as I'm concerned, I'm fine with providing a translation table
between Windows codepages and individual platforms (e.g. OS/2), but I'm
less comfortable with having to use this translation at runtime under all
platforms except for Windows and I'm somewhat worried about not having a
solution for supporting character set which may be used e.g. for console
on non-windows platforms but are not supported by Windows (have a look at
the URL sent by Jonas yesterday for Mac OS X; without having performed
complete comparison, it seemed to contain some character sets not listed
on the MSDN page for Windows).

>> of certain constants, I can imagine that we should be able to find a
>> gap in the windows character set numbering to cover at least all the
>> character sets registered by IANA.
>
> Implementing at all only makes sense if OSes implement them exactly.
> Several
> Windows codepages might map to corresponding IANA sets.

Do you have some examples of this case?

>> However, we need to provide mapping between the MS Windows character set
>> number and the native character set number for all character set numbers
>> defined in Windows and supported by the particular platform, otherwise
>> the
>> compatibility argument doesn't hold any longer, does it?
>
> Just like that you must be able to map the IANA sets to actually supported
> sets on all platforms.

Yes, absolutely. The only potential advantage of IANA numbers would be
ensured compatibility across future FPC versions without risk that we need
to "remap" the codepage numbers in the future due to MS or some other
vendor changing use of their platform specific constants. I don't say that
this is a must or necessarily the best option, just an option we may want
to consider depending on the use cases (see below).

>> > Note that is all just a guestimate on the size of the free ranges. But
>> I
>> > rather not expand that too much.
>>
>> I'm pretty sure that Windows actually support fewer character sets
>> than what is defined in IANA. Since Windows already use word values,
>> there should be fairly large gaps. Looking at the MSDN documentation
>> (http://msdn.microsoft.com/en-us/library/dd317756.aspx), there are
>> 152 values defined altogether and there's currently e.g. just a
>> single value used in the 3xxxx range, no value in 4xxxx, nothing
>> between 38 and 436 (probably rather unlikely to change, I'd expect
>> changes rather in other areas), nothing between 1362 and 9999, etc.
>
> If the ranges are large enough we can try to fit them in all somewhere.
> But
> this means the lesser used codepages are also in twice, blowing up
> lookuptables or codepages.

Yes. Either at compile time (where it makes no difference at all), or
possibly also at runtime where this means something like 1600 bytes on
32-bit platforms (assuming 200 records with 2 fields of 4 bytes each).

>> > > as I understand it, at least console character set information is
>> provided
>> > > using charset name provided in an environment variable there)?
>> >
>> > Put them in the table too, for Unix.
>>
>> >From certain perspective, these text versions may be useful for all
>> platforms (imagine HTML character set declarations).
>
>> However, there's a risk that they may not be used completely
>> consistently
>> across all platforms (IANA definitions allow quite a few alternative
>> versions for the character set names). BTW, the above mentioned MSDN
>> page
>> also refers to some string identifier supposedly used for .NET, so I
>> suspect that these become sooner or later supported by Delphi too
>> somehow.
>> ;-)
>
> I'd wait till this is entirely sure before exposing these names, and only
> on
> platforms that need them. Otherwise we find ourselves with 3 strings per
> codepage on all platforms before long in any library.
>
> Moreover, many OSes might already provide a way to resolve numbers to
> names.

Could be. If we need to maintain them anyway, we might also provide it as
a platform independent functionality (possibly also as an optional
additional unit, "just" based on the same include file defining it for
platforms which need this mapping for runtime anyway due to not having
numeric values associated with the supported character sets).

>> > But what is the alternative? Delphi incompability ? Everything
>> homemade and
>> > incompatible?
>>
>> We could e.g. use the MIBENUM number defined by IANA as our primary
>> identifier, that is not homemade. But the main point is IMHO
>> understanding how these values are used (in FPC). If they're mainly
>> used for checking whether the string stored in memory in some
>> character set needs to be converted before e.g. I/O operations via
>> console then we may actually prefer using platform specific constants
>> (i.e. different values for the same character set on different
>> platforms) because that doesn't require any conversion (well, at
>> least on platforms defining console codepages using numeric values).
>> If we want/need to store these constants when storing strings to file
>> streams and make the resulting files portable across platforms then
>> we obviously need to use the same constants for all platforms. If we
>> assume need for using the same stored streams in both Delphi and FPC
>> programs then this needs to be compatible between Delphi and FPC (are
>> they compatible in other aspects?). As you can see, I'm still not
>> that clear on the use cases at the moment.
>
> It would greatly confuse FPC-Delphi projects for a nearly sterile benefit.
> The problem is not even the change itself, but actually hunting them down,
> ifdefing them, getting the changes accepted etc.

I'm afraid that you haven't helped me too much with my questions regarding
the use cases. I'm still convinced that we should understand them first
before deciding on the FPC implementation (e.g. whether we translate some
Windows/Delphi constants to the platform specific codepage numbers at
compile time or at runtime).

Tomas