Summary on Re: [fpc-pascal] Unicode file routines proposal

Tue Jul 1 13:28:32 CEST 2008

I read most of the discussion and I think there is no way around a
string type containing an encoding field. First, it allows also to
support non utf encodings or utf-32 encoding. Having the encoding field
does not mean that all target support all encoding. In case an encoding
is not supported, the target might either use some default operation as
the current widestring manager does or it might spite out an exception.
Having such a string type requires some manager which does not only
store the procedures to handle this string type but which also contains
some information which encoding to prefer or even use solely. Combining
this with several ifdefs  and compiler switches makes this approach very
flexible and fast and allows everybody (FPC people, Lazarus, MSE) to
adapt things to their needs.

Just an example: to overcome the indexing problem efficiently when using
an encoding field (this is not about surrogates), we could do the
following: introduce a compiler switch {$unicodestringindex
default,byte,word,dword}. In default mode the compiler gets a shifting
value from the encoding field (this is 4 bytes anyways and could be
split into 1 byte shifting, 2 bytes encoding, 1 bytes reserved). In the
other modes the compiler uses the given size when indexing. For example,
a Tuberion (or how is it called?) switch could set this to word.

The approach has the big advantage, that you really need all procedures
only once if desired. For example e.g. linux would get only utf-8
routines by default, utf-16 is converted to utf-8 at the entry of the
helper procedures if needed. Usually, no conversion would be necessary
because you see seldomly utf-16 in linux applications so only the check
if the input strings are really utf-8 is necessary, this is very cheap
because the data is anyways already in a cache line.

Even more, this variable encoding approach allows also people using
languages where utf-8 is more memory expensive than utf-16 (this is in
numbers the majority of mankind) to use utf-8/utf-16 as needed to save
memory only with a few modifications.

I know this approach contains some hacks and requires some work but I
think this is the only way to solve things for once and ever.