[fpc-devel] cpstrrtl/unicode branch merged to trunk

Fri Sep 6 13:54:06 CEST 2013

Jonas Maebe schrieb:
> Hi,
> 
> I've just merged the cpstrrtl/unicode branch into trunk. Below you can find the commit message, which describes most changes, the added features and also a very important warning.
> 
> 
> Jonas
> 
>   o merged cpstrrtl branch (includes unicode branch). In general, this adds
>     support for arbitrarily encoded ansistrings to many routines related to
>     file system access (and some others).
>     
>   WARNING: while the parameters of many routines have been changed from
>     "ansistring" to "rawbytestring" to avoid data loss due to conversions,
>     this is not a panacea. If you pass a string concatenation to such a
>     parameter and not all strings in this concatenation have the same
>     code page, all strings and the result will be converted to
>     DefaultSystemCodePage (= ansi code page by default).

That conversion IMO is done by the every concatenation, apart from 
subroutine considerations.

> In particular,
>     concatenating e.g. an Utf8String with a constant string and passing
>     the result to a RawByteString parameter will convert the result into
>     the DefaultSystemCodePage (unless the source code is compiler with
>     {$modeswitch systemcodepage} or {$mode delphiunicode} *and* the ansi
>     code page on the system you are compiling *on* happens to be UTF-8)
>     
>     You can define and use alternative routines that explicitly accept
>     Utf8String parameters to avoid this pitfall. Internally, all of these
>     routines ensure that they never trigger this condition and ensure that
>     not unnecessary/unwanted code page conversions occur.

Delphi has overloaded functions for RawByteString and AnsiString(0). FPC 
could add another Utf8String overload.

I'm not sure how efficient a RawByteString version ever can be. By 
default it has to convert the string into Unicode (Delphi: UTF-16), and 
the result back to CP_ACP. In these cases it looks more efficient to 
call the Unicode version immediately, and leave *eventual* further 
conversions to the compiler. Some routines may implement common 
processing of true SBCS, but I'm not sure how many these are.

>   + SetMultiByteFileSystemCodePage() procedure to override the value of
>     DefaultFileSystemCodePage
>   + ToSingleByteFileSystemEncodedFileName() function to convert a string to to
>     DefaultFileSystemCodePage (does *not* take care of OS-specific quirks like
>     Darwin always returning file names in decomposed UTF-8)
>   + support for CP_OEMCP
>   * textrec/filerec now store the filename by default using widechar. It is
>     possible to switch back to ansichars using the FPC_ANSI_TEXTFILEREC define.
>     In that case, from now on the filename will always be stored in
>     DefaultFileSystemEncoding

Does there exist a FileSystemString type, for easy use in RTL and 
application code?

DoDi