[fpc-devel] cpstrrtl/unicode branch merged to trunk

Fri Sep 6 23:13:58 CEST 2013

On 06 Sep 2013, at 13:54, Hans-Peter Diettrich wrote:

> Jonas Maebe schrieb:
>>  o merged cpstrrtl branch (includes unicode branch). In general, this adds
>>    support for arbitrarily encoded ansistrings to many routines related to
>>    file system access (and some others).
>>      WARNING: while the parameters of many routines have been changed from
>>    "ansistring" to "rawbytestring" to avoid data loss due to conversions,
>>    this is not a panacea. If you pass a string concatenation to such a
>>    parameter and not all strings in this concatenation have the same
>>    code page, all strings and the result will be converted to
>>    DefaultSystemCodePage (= ansi code page by default).
> 
> That conversion IMO is done by the every concatenation, apart from subroutine considerations.

I think you mean "afaik" rather than "IMO". Anyway, the resulting code page of a concatenation is normally the code page of whatever you assign the string to (so if you assign to an utf8string, the resulting code page will be CP_UTF8). RawByteString is different in two ways:
a) if all concatenated strings have the same code page, the result also gets that code page
b) if there are different code pages involved, the result gets DefaultSystemCodePage

I think we could actually introduce a global variable in the system unit that changes the behaviour of b) to "the result will have a code page that can represent all characters from the concatenated strings", which by default is off. Turning it on should even break most Delphi code, since when a parameter or variable is RawByteString then the code should be able to deal with any possible single byte code page anyway.

> Delphi has overloaded functions for RawByteString and AnsiString(0).

Does this really compile in Delphi?

type
  tcp866 = type ansistring(866);
  tutf8string = type ansistring(65001);

procedure test(const a: ansistring); overload;
begin
  writeln('ansistring');
end;

procedure test(const a: rawbytestring); overload;
begin
  writeln('rawbytestring');
end;

procedure test(const a: tutf8string); overload;
begin
  writeln('utf8string');
end;

var
  a: ansistring;
  b: tcp866;
begin
  test('abc');
  a:='ab';
  b:='c';
  test(a);
  test(b);
  test(a+b);
end.

Besides, utf8 *overloads* would be useless since even if the above compiles and has some sensible behaviour, such overloads would only be called if you pass in an utf8string. That already works correctly today. The problem is when passing in concatentations of strings.

> I'm not sure how efficient a RawByteString version ever can be. By default it has to convert the string into Unicode (Delphi: UTF-16),

No, see Sven's answer.

> and the result back to CP_ACP. In these cases it looks more efficient to call the Unicode version immediately, and leave *eventual* further conversions to the compiler. Some routines may implement common processing of true SBCS, but I'm not sure how many these are.

Even if you are on a platform with UTF-16 system interfaces, if you call the routine with a single byte string, then calling the rawbytestring version will always be more efficient than the unicodestring version because the code size will be smaller (the conversions are inside the wrapper rather than everywhere in your code where you call this routine). These routines do not convert more than necessary.

>>  + SetMultiByteFileSystemCodePage() procedure to override the value of
>>    DefaultFileSystemCodePage
>>  + ToSingleByteFileSystemEncodedFileName() function to convert a string to to
>>    DefaultFileSystemCodePage (does *not* take care of OS-specific quirks like
>>    Darwin always returning file names in decomposed UTF-8)
>>  + support for CP_OEMCP
>>  * textrec/filerec now store the filename by default using widechar. It is
>>    possible to switch back to ansichars using the FPC_ANSI_TEXTFILEREC define.
>>    In that case, from now on the filename will always be stored in
>>    DefaultFileSystemEncoding
> 
> Does there exist a FileSystemString type, for easy use in RTL and application code?

No, and you should be completely oblivious to it. Adding new magic code pages or string types to the existing mess would not be helpful. Use utf8string or unicodestring if you want to ensure that you don't have data loss. The DefaultFileSystemCodePage only exists to ensure that
a) no data is lost between you giving a string to an RTL routine and calling a single byte OS API
b) file names returned by single byte OS APIs are correctly handled/interpreted by the RTL and get their code page set correctly

Jonas