[fpc-devel] Unicode resource strings
Jonas Maebe
jonas.maebe at elis.ugent.be
Tue Aug 21 12:00:51 CEST 2012
marcov wrote on Tue, 21 Aug 2012:
> In our previous episode, Mattias Gaertner said:
>>
>> For example under Linux file names are treated as UTF-8 but are only
>> bytes. They can and they do contain invalid UTF-8 characters.
>> If your program should support this, you must use a FindFirst
>> with UTF-8. To be clear: I don't say the default FindFirst under Linux
>> must be UTF-8, I only say, there must be one version with UTF-8, e.g.
>> FindFirstU8 and that must directly use the Linux file functions
>> without conversions.
>
> That's ugly indeed. Since that doesn't mean just an utf8 overload,
Since it's just raw bytes, it's actually as much utf-8 as it is
Windows Latin-1.
> but that
> the entire internal trajectory behind that (searchrec inclusive) must be
> 1-byte without conversion. Or the 1-byte to utf16 and back conversion must
> be stable. (invF(F(x))=x
Other frameworks also have to deal with this, and generally have a
particular default and allow the programmer (and sometimes the end
user) to override the default behaviour. E.g., glib assumes all file
names are UTF-8, but you can change this to "assume file names are
encoded in the current user's locale" or to "assume file names are
encoded using encoded XYZ" (either programmatically or via an
environment variable). Qt assumes they are encoded in the current
user's locale, but the programmer can change this to a different code
page (no environment variable). In practice, the default Qt and glib
behaviour is almost always the same on Linux nowadays, since UTF-8
locales are the default.
I'm not aware of a framework that allows you to say that file names
are just random bytes. It would probably be possible to implement this
in FPC by adding "support" for the invalid $FFFF code page (both in
ansistring and in unicodestring) and never converting anything if that
one is used (basically overwrite the destination string's codepage
with $FFFF if it's used by the source). Other options are not
supporting invalid file names in the cross-platform RTL interface
(have to use platform-specific APIs to deal with them on platforms
that "support" such file names, like with glib and Qt), optionally
adding "raw" overloads of such functions that possibly even accept and
return arrays of byte rather than strings in order to avoid any
accidental conversions and to make it clear what you're dealing with.
Jonas
More information about the fpc-devel
mailing list