[fpc-devel] Unicode resource strings

Tue Aug 21 12:00:51 CEST 2012

marcov wrote on Tue, 21 Aug 2012:

> In our previous episode, Mattias Gaertner said:
>>
>> For example under Linux file names are treated as UTF-8 but are only
>> bytes. They can and they do contain invalid UTF-8 characters.
>> If your program should support this, you must use a FindFirst
>> with UTF-8. To be clear: I don't say the default FindFirst under Linux
>> must be UTF-8, I only say, there must be one version with UTF-8, e.g.
>> FindFirstU8 and that must directly use the Linux file functions
>> without conversions.
>
> That's ugly indeed. Since that doesn't mean just an utf8 overload,

Since it's just raw bytes, it's actually as much utf-8 as it is  
Windows Latin-1.

> but that
> the entire internal trajectory behind that (searchrec inclusive) must be
> 1-byte without conversion. Or the 1-byte to utf16 and back conversion must
> be stable.   (invF(F(x))=x

Other frameworks also have to deal with this, and generally have a  
particular default and allow the programmer (and sometimes the end  
user) to override the default behaviour. E.g., glib assumes all file  
names are UTF-8, but you can change this to "assume file names are  
encoded in the current user's locale" or to "assume file names are  
encoded using encoded XYZ" (either programmatically or via an  
environment variable). Qt assumes they are encoded in the current  
user's locale, but the programmer can change this to a different code  
page (no environment variable). In practice, the default Qt and glib  
behaviour is almost always the same on Linux nowadays, since UTF-8  
locales are the default.

I'm not aware of a framework that allows you to say that file names  
are just random bytes. It would probably be possible to implement this  
in FPC by adding "support" for the invalid $FFFF code page (both in  
ansistring and in unicodestring) and never converting anything if that  
one is used (basically overwrite the destination string's codepage  
with $FFFF if it's used by the source). Other options are not  
supporting invalid file names in the cross-platform RTL interface  
(have to use platform-specific APIs to deal with them on platforms  
that "support" such file names, like with glib and Qt), optionally  
adding "raw" overloads of such functions that possibly even accept and  
return arrays of byte rather than strings in order to avoid any  
accidental conversions and to make it clear what you're dealing with.

Jonas