[fpc-devel] Unicode resource strings

Marco van de Voort marcov at stack.nl
Tue Aug 21 11:39:35 CEST 2012


In our previous episode, Mattias Gaertner said:
> 
> IMO unicodestring should be the same on all platforms, because
> otherwise the character size switches per platform, which is hard to
> test and asking for trouble.

I think the big issue is more about what "string" will be when the FPC is
compiled in modes that are now objfpc h+  and delphi.

And then specially anything you override or pass VAR strings too.

> The compiler already supports an UTF8String, right?
> If yes, then some functions can use UTF8String, some UnicodeString
> (=UTF-16) and the compiler magic will convert automatically.

rawbytestring and unicodestring overloaded. See thread in fpc-pascal of a
few days back with subject "rawbytestring".

> The difficult decision is what functions and types should use UTF-8
> and what UTF-16. This may depend on the platform.

The question is if you fixate the classes hierarchy to a certain type on all
platforms, (to avoid problems with virtual/override and VAR) does it make
sense to finely grain divide the RTL over both encoding types.

That stringtype will be so dominant in practice, doing the RTL in a
different stringtype depending on platform won't be as useful.
 
> One problem is that an UTF-8/16 string can contain invalid characters
> making it impossible to convert.
> For example under Linux file names are treated as UTF-8 but are only
> bytes. They can and they do contain invalid UTF-8 characters.
> If your program should support this, you must use a FindFirst
> with UTF-8. To be clear: I don't say the default FindFirst under Linux
> must be UTF-8, I only say, there must be one version with UTF-8, e.g.
> FindFirstU8 and that must directly use the Linux file functions
> without conversions.

That's ugly indeed. Since that doesn't mean just an utf8 overload, but that
the entire internal trajectory behind that (searchrec inclusive) must be
1-byte without conversion. Or the 1-byte to utf16 and back conversion must
be stable.   (invF(F(x))=x

> I guess there is no good solution for TStrings. Whatever string type is
> chosen, some programs will suffer.

tstrings will be "string". So whatever "string" is chosen for the OOP FPC
code (see first paraphraph), that will be the declaration of tstrings.

But D2009 changes many streaming related routines (load/save file/stream) to
add a encoding parameter with some default value. This decouples tstrings
disk format from memory format. Maybe that fixes your worry ?





More information about the fpc-devel mailing list