[fpc-devel] Unicode and UTF8String

Mon Dec 1 15:47:49 CET 2008

In our previous episode, Martin Friebe said:
> > most cases, it is slowly time to abandon too simplistic thinking about
> > strings. The best solution is to minimize editing, and localize them in
> > certain parts of the code, keeping most of the code encoding agnostic.
> >   
> True, too. But we are talking Pascal, not some other language. 
> string[index], copy, pos, length have always been part of Pascal.

So keep using ansistring? It doesn't change.

> Of course they are still there, to be used in the few parts of your 
> code, where you specialize on whatever string type you deal with.
> But otherwise, using  RTLString  IMHO will abandon this part of pascal 
> syntax.

It removes ASCII legacy. I don't see you complaining about the fact that
char is not 8 bit anymore, and that that abandons that part of the pascal
syntax.

> A function of which the result can not be used, as it can 
> change at compile time => such a function can not be used. (or we will 
> have buffer overflows, code injection and more ...)

Hence my suggestion to minimize this functionality.

> I admit that the Problem started (and that has been discussed more than 
> enough) starts with UTF8string (yes even with utf16 string). But in this 
> case those functions became a new, but predictable meaning. I can do 
> utf8string[1], and I can use the result. Only I have to be aware what it 
> means.

Yes. As widestring[1] also requires interpretation. That's unicode.

> I can *not* do rtlString[1], as at the time of code writing I can not be 
> aware what it means.

You don't have to. You carry it around as long as you can, and when you
don't can, you assign it to your type of choice and bite the penalty.

Delaying that as long as possible avoids excessive penalities, which IMHO
are as much part of the Pascal language. Doing that would hurt the general
purpose nature by turning into basic. (and then I mean the real Basics, not
the C-with-basic-syntax that is FreeBasic), or worse: Excel.

> It is only decided, at compilation time. IFDEFs won't help neither,
> because they can only cope with the set of stringtypes know at the time
> the code is written.  This breaks each time FPC will be extended.

Any such big transition as ASCII -> Unicode will break. However we have had
these discussions before, but avoiding all pitfalls is simply too costly,
and that breaks other Pascal traditions.

>  > and localize them in
>  > certain parts of the code, keeping most of the code encoding agnostic.
> Sorry I can't help taking that into another direction, (which also has 
> been discussed before). The above quote sounds like a sentence from a 
> introduction into  "object orientation". 

It is an introduction to abstraction maybe. I don't see the OO in there.

> It is right for OO. So it should be right for strings as well.
> Just again, it simply will be a new language, which a string-object, 
> instead of pascal.

This is all gibberish for me. I never said OO, and never will.

> > And yes, if you lazy, you lose performance due to automatic conversions. It
> > has always been that way (also when mixing short and ansistring)
> >   
> In other words, write pascal code, just do not use some of the (imho) 
> most common elements of pascal syntax?

There is no "just". Strings simply get more complicated if you go unicode,
and that can't be hidden. Either you stay with safe ASCII strings, or you
use Unicode. If you do the latter, you will have to adapt anyway.

And top-heavy emulation layers are not Pascallike either.

> I acknowledge a language is a living thing, and needs to be adjusted to 
> the new things, that come up over time. I only ask, if this is the best way?

IMHO there is not even a choice, since there simply no is a viable
alternative.