[fpc-devel] cpstrrtl/unicode branch merged to trunk

Sat Sep 7 08:38:12 CEST 2013

On 07 Sep 2013, at 01:39, Hans-Peter Diettrich wrote:

> Jonas Maebe schrieb:
>> On 06 Sep 2013, at 13:54, Hans-Peter Diettrich wrote:
>>> That conversion IMO is done by the every concatenation, apart from
>>> subroutine considerations.
>> I think you mean "afaik" rather than "IMO".
> 
> I don't talk about concrete code, so I cannot know anything.

"In my opinion" means "I know the facts, and this is my opinion about those facts". "As far as I know" means "to the extent of my (possibly limited/incomplete/wrong) knowledge" or, more or less, "I think/believe".

>> Anyway, the resulting code page of a concatenation is normally the
>> code page of whatever you assign the string to (so if you assign to
>> an utf8string, the resulting code page will be CP_UTF8).
> 
> Maybe. I don't know how a concrete compiler handles string concatenations.

Now you do. It's like that both in FPC and in Delphi.

>>> Delphi has overloaded functions for RawByteString and
>>> AnsiString(0).
>> Does this really compile in Delphi?
> 
> It compiles, of course.

I think that a "can't decide which overloaded function to call" error would be just as logical.

> Unfortunately I cannot test the outcome in detail, due to the broken UTF-8 implementaion in my Delphi XE.

The example program I posted does contain any actual UTF-8 strings (only an utf8srting parameter, but if even that causes problems, you can replace it by any other custom ansistring type).

>> Besides, utf8 *overloads* would be useless since even if the above
>> compiles and has some sensible behaviour, such overloads would only
>> be called if you pass in an utf8string. That already works correctly
>> today. The problem is when passing in concatentations of strings.
> 
> The overloads are required to prefer existing implementations, in favor of clumsy RawByteString or overkill UTF-16 strings.

There is nothing particularly clumsy about the RawByteString implementations in our RTL.

> But there exist special situations, essentially defeating the use of RawByteString arguments at all. Delphi (XE) has several broken functions, which return wrong indices when passing in *other* than the explicitly handled CP_ACP strings. All functions, returning indices (lengths...) for the *original* string must *never* convert its encoding. I discussed this already a long time ago, when reporting my first experiences with Delphi XE.

Please look at and/or test the FPC code, then present your findings about what is actually implemented for FPC.

> Please tell me how a RawByteString can be handled *efficiently*, as soon as its encoding is *not* CP_ACP [or cpUTF8 if implemented explicitly].

By ignoring the encoding, because most routines don't care about the actual character values. Those that do (at least the ones mentioned in the commit message) only care about non-control characters in the ASCII range, which are identical in all encodings (except for EBDIC, which is not yet supported; I forgot to mention that) and hence those routines don't need special code either.

>> The DefaultFileSystemCodePage only exists to ensure
>> that a) no data is lost between you giving a string to an RTL routine
>> and calling a single byte OS API b) file names returned by single
>> byte OS APIs are correctly handled/interpreted by the RTL and get
>> their code page set correctly
> 
> So why do you only specify a codepage, without also declaring an accordingly encoded string type? Such a type would allow for automatic conversion (if ever required) of file/directory name arguments in RTL functions. Which string type would you use for such arguments?

All of those routines return a "RawByteString". You can assign such a string to any ansistring type (including a plain ansistring) and the compiler will not insert any kind of code page conversion (and the string's code page will also remain the same, even if it is different from the declared code page of the string you assigned the value to). You can also pass those strings again as parameters to those routines without any data loss (or implicit conversion) because the routines have RawByteString parameters. If you have other string arguments you want to pass to those routines, their initial encoding will presumably depend on where those strings came from. The routines will convert them to the correct code page if necessary.

If you want to perform intermediate operations on such strings yourself rather than using standard RTL routines that take rawbytestring parameters (which all should preserve the data, otherwise file a bug report; and where possible they do so without conversions), use utf8string or unicodestring if you don't want to risk data loss. E.g., we also don't have special string types corresponding to the console input or output code pages, but when using readln(ansistringvar) the code page of ansistringvar will be set to the input code page of the console and it will contain data encoded in that code page regardless of its declared code page. And when using writeln(ansistringvar), the data will also be converted to the output code page of the console if it turns out to be different. If you perform intermediate operations on such data, use utf8string or unicodestring if you don't want data loss, or use RTL routines that take/return rawbytestring parameters.

Jonas