[fpc-devel] UTF8 RTL
nc-gaertnma at netcologne.de
Wed Nov 19 11:39:29 CET 2014
On Wed, 19 Nov 2014 09:22:21 +0100
Jonas Maebe <jonas.maebe at elis.ugent.be> wrote:
> On 19/11/14 09:12, Marco van de Voort wrote:
> > In our previous episode, Jonas Maebe said:
> >>> As Jonas said, not using utf8 on Windows.
> >> No, that's not what I said. There is no problem with using UTF-8 on Windows.
> > As long as you explicitely use utf8string.
The RTL does not use UTF8String, so this would create a lot of codepage
checks and/or conversions.
For Lazarus there is a better solution:
The RTL on Windows now uses the "W" functions and the AnsiString and
ShortString are encoded in CP_ACP. Changing the DefaultSystemCodePage
to CP_UTF8 does the trick. All UnicodeString to AnsiString assignments
convert to UTF8 and vice versus.
This breaks code that still uses the non Unicode functions or
reads/writes non ASCII strings in system code page. Which is why this
change will be optional in Lazarus.
The whole Lazarus code didn't need a change for this, so I guess big
parts of users code will happily run as well. With the exception, that
aStringList.LoadFromFile now works with Unicode paths, instead of only
characters from the system code page.
Users can then get rid of many UTF8 calls like UTF8ToAnsi or
FileExistsUTF8. This will make porting to the coming
UnicodeString RTL easier as well.
> An ansistring with a dynamic code page of UTF-8 will also work fine with
> the adapted RTL routines. What will of course not work is an ansistring
> containing UTF-8 data while its dynamic codepage is different from
> CP_UTF8 (that includes CP_ACP in case DefaultSystemCodePage happens to
> be CP_UTF8), such as the current Lazarus convention.
> That is however wrong on all platforms, not just on Windows (no one must
> ever assume that the system code page on a unix platform is UTF-8
> either; it's the same as assuming that the keyboard layout is qwerty or
> that the processor is little endian).
True. Although it is rare nowadays that someone uses non ASCII
characters on a non UTF-8 Unix. Especially for graphical systems.
More information about the fpc-devel