[fpc-devel] new string - question on usage

Thu Oct 13 12:07:45 CEST 2011

Paul Ishenin schrieb:

>> What's CP_NONE? Value and purpose?
> 
> RawByteString codepage. Value $FFFF and purpose - inform that string has 
> no codepage assigned. I think at the moment compiler does not produce 
> strings of codepage $FFFF anymore but before it did. So now we can 
> probably clear the RTL from this codepage checks.

Thanks :-)

>> It turned out that the result only is correct when at least one of the
>> strings is an UnicodeString. Otherwise Pos seems to end up in a
>> RawByteString compare, with the encoding ignored.
> 
> That's because if one UnicodeString type is present another Pos() works. 
> In this case the second RawByteString argument converts into 
> UnicodeString with taking encoding into account.

Pos accepts only strings of the same type, with AnsiStrings (any 
codepage) being passed as RawByteStrings. When one argument is a 
UnicodeString, the other argument is converted to Unicode as well. This 
again is a source of trouble, because
   pos(string(s1251), s866)
will return the index in the *Unicode* string, into which s866 is 
implicitly converted :-(

The following test also tends to fail:
   i := pos(string(s1251), sUtf8);
   rest := Copy(s866, i+Length(sUtf8), 10);
The first bug is the index, which is wrong with MBCS characters in 
sUtf8, the second bug is the possibly different Length of the substr, in 
cp_866 and cp_UTF8.

Unless the new AnsiString support is improved considerably (in Delphi or 
FPC), such string types are quite useless. At least it looks mandatory 
that the RTL, other packages *and* the application use only strings of 
the same encoding, so that no implicit conversions are necessary (except 
between AnsiString and UnicodeString, as is). Then also the old string 
header record can be used, no need to put in an encoding.

As a workaround I'd suggest that RawByteString Pos() converts the SubStr 
into the encoding of the *second* string, so that the comparison finds 
the correct index, applicable to the original string.

> Old Pos() works without codepage conversions. This shows the test I gave 
> and other tests.

Old Pos() and old AnsiString, as well as ShortString, assumed native 
encoding, so there existed no need for codepage conversions. UTF-8 
strings deserved special care, because no subroutine could detect the 
encoding of an string parameter. The new AnsiString types *should* cure 
that problem, but obviously they don't (yet) :-(

DoDi