[fpc-devel] cpstrrtl/unicode branch merged to trunk

Sat Sep 7 01:39:07 CEST 2013

Jonas Maebe schrieb:
> On 06 Sep 2013, at 13:54, Hans-Peter Diettrich wrote:
> 
>> Jonas Maebe schrieb:
>>> o merged cpstrrtl branch (includes unicode branch). In general,
>>> this adds support for arbitrarily encoded ansistrings to many
>>> routines related to file system access (and some others). 
>>> WARNING: while the parameters of many routines have been changed
>>> from "ansistring" to "rawbytestring" to avoid data loss due to
>>> conversions, this is not a panacea. If you pass a string
>>> concatenation to such a parameter and not all strings in this
>>> concatenation have the same code page, all strings and the result
>>> will be converted to DefaultSystemCodePage (= ansi code page by
>>> default).
>> That conversion IMO is done by the every concatenation, apart from
>> subroutine considerations.
> 
> I think you mean "afaik" rather than "IMO".

I don't talk about concrete code, so I cannot know anything.

> Anyway, the resulting code page of a concatenation is normally the
> code page of whatever you assign the string to (so if you assign to
> an utf8string, the resulting code page will be CP_UTF8).

Maybe. I don't know how a concrete compiler handles string 
concatenations. When using subroutines for that purpose, the Result type 
specifies the target encoding, not the variable to which the result is 
assigned subsequently.

> RawByteString is different in two ways: a) if all concatenated
> strings have the same code page, the result also gets that code page 
> b) if there are different code pages involved, the result gets
> DefaultSystemCodePage

ACK. This means that the result may have to be converted once again, 
before assigning it to the final target.

> I think we could actually introduce a global variable in the system
> unit that changes the behaviour of b) to "the result will have a code
> page that can represent all characters from the concatenated
> strings", which by default is off. Turning it on should even break
> most Delphi code, since when a parameter or variable is RawByteString
> then the code should be able to deal with any possible single byte
> code page anyway.

That makes no difference when CP_ACP already is UTF-8.

The only break can be increased accuracy, when FPC returns an lossless 
UTF-8 string, where Delphi would return an (lossy) CP_ACP string. Until 
that point all internal operations are lossless, using either the unique 
encoding of an single string parameter, or Unicode (UTF-8/16).

>> Delphi has overloaded functions for RawByteString and
>> AnsiString(0).
> 
> Does this really compile in Delphi?

It compiles, of course. Unfortunately I cannot test the outcome in 
detail, due to the broken UTF-8 implementaion in my Delphi XE.

> Besides, utf8 *overloads* would be useless since even if the above
> compiles and has some sensible behaviour, such overloads would only
> be called if you pass in an utf8string. That already works correctly
> today. The problem is when passing in concatentations of strings.

The overloads are required to prefer existing implementations, in favor 
of clumsy RawByteString or overkill UTF-16 strings.

> 
>> I'm not sure how efficient a RawByteString version ever can be. By
>> default it has to convert the string into Unicode (Delphi: UTF-16),
>> 
> 
> No, see Sven's answer.

You can safely remove the RawByteString versions, when the 
compiler-generated conversions into UnicodeString and back afterwards 
will do exactly the same.

But there exist special situations, essentially defeating the use of 
RawByteString arguments at all. Delphi (XE) has several broken 
functions, which return wrong indices when passing in *other* than the 
explicitly handled CP_ACP strings. All functions, returning indices 
(lengths...) for the *original* string must *never* convert its 
encoding. I discussed this already a long time ago, when reporting my 
first experiences with Delphi XE.

>> and the result back to CP_ACP. In these cases it looks more
>> efficient to call the Unicode version immediately, and leave
>> *eventual* further conversions to the compiler. Some routines may
>> implement common processing of true SBCS, but I'm not sure how many
>> these are.
> 
> Even if you are on a platform with UTF-16 system interfaces, if you
> call the routine with a single byte string, then calling the
> rawbytestring version will always be more efficient than the
> unicodestring version because the code size will be smaller (the
> conversions are inside the wrapper rather than everywhere in your
> code where you call this routine). These routines do not convert more
> than necessary.

Please tell me how a RawByteString can be handled *efficiently*, as soon 
as its encoding is *not* CP_ACP [or cpUTF8 if implemented explicitly].

>> Does there exist a FileSystemString type, for easy use in RTL and
>> application code?
> 
> No, and you should be completely oblivious to it. Adding new magic
> code pages or string types to the existing mess would not be helpful.

IMO it can make a big difference, when all filename strings in a program 
(variables, StringLists...) can have the exact type as used in API calls.

> Use utf8string or unicodestring if you want to ensure that you don't
> have data loss.

My point is the number of possible implicit conversions, not data loss.

> The DefaultFileSystemCodePage only exists to ensure
> that a) no data is lost between you giving a string to an RTL routine
> and calling a single byte OS API b) file names returned by single
> byte OS APIs are correctly handled/interpreted by the RTL and get
> their code page set correctly

So why do you only specify a codepage, without also declaring an 
accordingly encoded string type? Such a type would allow for automatic 
conversion (if ever required) of file/directory name arguments in RTL 
functions. Which string type would you use for such arguments?

DoDi