Summary on Re: [fpc-pascal] Unicode file routines proposal

Marco van de Voort marcov at stack.nl
Tue Jul 1 14:42:58 CEST 2008


Ok a quick pointwise comment then.

> I read most of the discussion and I think there is no way around a
> string type containing an encoding field. First, it allows also to
> support non utf encodings or utf-32 encoding. Having the encoding field
> does not mean that all target support all encoding. In case an encoding
> is not supported, the target might either use some default operation as
> the current widestring manager does or it might spite out an exception.
> Having such a string type requires some manager which does not only
> store the procedures to handle this string type but which also contains
> some information which encoding to prefer or even use solely. Combining
> this with several ifdefs  and compiler switches makes this approach very
> flexible and fast and allows everybody (FPC people, Lazarus, MSE) to
> adapt things to their needs.

I don't like the runtime nature. At all. I want to be able to say "hey look,
I've a bunch of units here, and they only accept utf16, (e.g. because they were
ported Tiburon code). Convert if necessary"

So we need at least one directive in that case, one that says "all
unicodestrings under this directive are in encoding type <n>, convert if
necessary".

> Just an example: to overcome the indexing problem efficiently when using
> an encoding field (this is not about surrogates), we could do the
> following: introduce a compiler switch {$unicodestringindex
> default,byte,word,dword}. In default mode the compiler gets a shifting
> value from the encoding field (this is 4 bytes anyways and could be
> split into 1 byte shifting, 2 bytes encoding, 1 bytes reserved). In the
> other modes the compiler uses the given size when indexing. For example,
> a Tuberion (or how is it called?) switch could set this to word.

I don't understand how this can work, how can I have a compiletime solution
for a runtime problem?

procedure mystringproc (s:FlorianUnicodeString);

begin
  if encodingof(s)=utf-16 then
    begin
      // utf-16 code here with shiftsize 2 [] needed
    end
  else
    begin
      // utf-8 code here with shiftsize 1 [] needed
    end;
end;

> The approach has the big advantage, that you really need all procedures
> only once if desired. For example e.g. linux would get only utf-8
> routines by default, utf-16 is converted to utf-8 at the entry of the
> helper procedures if needed. Usually, no conversion would be necessary
> because you see seldomly utf-16 in linux applications so only the check
> if the input strings are really utf-8 is necessary, this is very cheap
> because the data is anyways already in a cache line.

> Even more, this variable encoding approach allows also people using
> languages where utf-8 is more memory expensive than utf-16 (this is in
> numbers the majority of mankind) to use utf-8/utf-16 as needed to save
> memory only with a few modifications.
> 
> I know this approach contains some hacks and requires some work but I
> think this is the only way to solve things for once and ever.

I wonder if having 2,3 (utf-8,16 and maybe -32) straight simple unicode types
isn't easier than this polymorphic beast. 

At least then you have one procedure, one encoding, and since they all
guaranteedly convert (and to comstring too), the conversion code might be
not as painful as when ansistring and widestring were introduced. It could
be parameterisable in the compiler. With the added advantage of compiletime
decisions.



More information about the fpc-pascal mailing list