[fpc-devel] Unicode and UTF8String

Martin Friebe fpc at mfriebe.de
Mon Dec 1 15:30:28 CET 2008


Marco van de Voort wrote:
> In our previous episode, Martin Friebe said: 
>   
>> I agree, using RTlString will probably help fpc to optimize your exe for 
>> each OS.
>>
>> But, using RTLString means you do not know, if you have UTF8 or not. 
>>     
> Correct.
>   
>> Because UTF8 behaves slightly different from other Strings, many 
>> operations can not be performed on RTLString
>>
>> foo[1], copy, pos ... simply because you do not know, if the result is a 
>> char, a codepoint or a subcodepoint (single utf8 byte)
>>     
> You don't know that about UTF-16 either. Even though that is no problem in
>   
True, good point
> most cases, it is slowly time to abandon too simplistic thinking about
> strings. The best solution is to minimize editing, and localize them in
> certain parts of the code, keeping most of the code encoding agnostic.
>   
True, too. But we are talking Pascal, not some other language. 
string[index], copy, pos, length have always been part of Pascal.

Of course they are still there, to be used in the few parts of your 
code, where you specialize on whatever string type you deal with.
But otherwise, using  RTLString  IMHO will abandon this part of pascal 
syntax.  A function of which the result can not be used, as it can 
change at compile time => such a function can not be used. (or we will 
have buffer overflows, code injection and more ...)

I admit that the Problem started (and that has been discussed more than 
enough) starts with UTF8string (yes even with utf16 string). But in this 
case those functions became a new, but predictable meaning. I can do 
utf8string[1], and I can use the result. Only I have to be aware what it 
means.

I can *not* do rtlString[1], as at the time of code writing I can not be 
aware what it means. It is only decided, at compilation time. IFDEFs 
won't help neither, because they can only cope with the set of 
stringtypes know at the time the code is written.  This breaks each time 
FPC will be extended.

 > and localize them in
 > certain parts of the code, keeping most of the code encoding agnostic.
Sorry I can't help taking that into another direction, (which also has 
been discussed before). The above quote sounds like a sentence from a 
introduction into  "object orientation".  Sure it is the right thing.. 
It is right for OO. So it should be right for strings as well.
Just again, it simply will be a new language, which a string-object, 
instead of pascal.

> And yes, if you lazy, you lose performance due to automatic conversions. It
> has always been that way (also when mixing short and ansistring)
>   
In other words, write pascal code, just do not use some of the (imho) 
most common elements of pascal syntax?
I acknowledge a language is a living thing, and needs to be adjusted to 
the new things, that come up over time. I only ask, if this is the best way?

> This is not just a good thing for OS interfacing code, but a good thing in
> general.
>
> _______________________________________________
> fpc-devel maillist  -  fpc-devel at lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel
>   



More information about the fpc-devel mailing list