[fpc-devel] assign constant text to widestring

Michael Schnell mschnell at lumino.de
Wed Oct 22 15:36:22 CEST 2008


Hi Experts,

There has been a long winding discussion on this in the "German Lazarus 
Forum" and I have been very dissatisfied with the result.

Maybe this already has been discussed in one of the "Unicode"  threads 
here, but I did not follow all of them down to the latest twig and leaf. 
So I start a new thread hoping for a more comprehensive result.

When using UTF8String I found that if s is an UTF8String containing 
"ö2", length(s) is 3 and s[3] is "2". Obviously, UTF8Strings content is 
counted regarding the 8 bit sub-codes and not the "visible" characters.  
While I don't like this "un-String-like" behavior at all, I am aware 
that this is by design to guarantee a decent speed.

But happily we don't need to use UTF8Strings to handle Unicode, as we do 
have WideStrings, which suffer from this queer behavior only when we try 
to store extremely strange characters (Unicode > $FFFF) using "surrogate 
pairs". I feel that I am very unlikely to ever need to do this.

So I did some tests with WideStrings and found strange things with them, 
too. While some of them are Lazarus issues, one quite obviously is 
introduced by the compiler.

When I want to simply assign a constant text "ö2" to a WideString I 
would think that I just write s := 'ö2'; . But I found that this does 
not work, but that it creates a WideString of length 3 that contains the 
three 8-Bit subcodes of the utf8-coded string "ö2", zero-extended to 16 
Bits, each in one WideChar element. For me this is very surprising and 
incompatible to the same code (s := 'ö2'; ) used in a Turbo-Delphi program.

Obviously - other than Turbo-Delphi that uses ANSIString here - a 
constant string gets UTF8String as it's intermediate type. This might be 
a useful definition, but if that is done this way why does an assignment 
WideString := UTF8String inot implicitly call UTF8Decode as a type 
conversion ? In my example it calls  fpc_ansistr_to_widestr instead, 
just as if the UTF8String would be an ANSIString.

Is there some compiler setting to change this ?

-Michael
 



More information about the fpc-devel mailing list