[fpc-devel] new string - question on usage
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Thu Oct 13 12:07:45 CEST 2011
Paul Ishenin schrieb:
>> What's CP_NONE? Value and purpose?
>
> RawByteString codepage. Value $FFFF and purpose - inform that string has
> no codepage assigned. I think at the moment compiler does not produce
> strings of codepage $FFFF anymore but before it did. So now we can
> probably clear the RTL from this codepage checks.
Thanks :-)
>> It turned out that the result only is correct when at least one of the
>> strings is an UnicodeString. Otherwise Pos seems to end up in a
>> RawByteString compare, with the encoding ignored.
>
> That's because if one UnicodeString type is present another Pos() works.
> In this case the second RawByteString argument converts into
> UnicodeString with taking encoding into account.
Pos accepts only strings of the same type, with AnsiStrings (any
codepage) being passed as RawByteStrings. When one argument is a
UnicodeString, the other argument is converted to Unicode as well. This
again is a source of trouble, because
pos(string(s1251), s866)
will return the index in the *Unicode* string, into which s866 is
implicitly converted :-(
The following test also tends to fail:
i := pos(string(s1251), sUtf8);
rest := Copy(s866, i+Length(sUtf8), 10);
The first bug is the index, which is wrong with MBCS characters in
sUtf8, the second bug is the possibly different Length of the substr, in
cp_866 and cp_UTF8.
Unless the new AnsiString support is improved considerably (in Delphi or
FPC), such string types are quite useless. At least it looks mandatory
that the RTL, other packages *and* the application use only strings of
the same encoding, so that no implicit conversions are necessary (except
between AnsiString and UnicodeString, as is). Then also the old string
header record can be used, no need to put in an encoding.
As a workaround I'd suggest that RawByteString Pos() converts the SubStr
into the encoding of the *second* string, so that the comparison finds
the correct index, applicable to the original string.
> Old Pos() works without codepage conversions. This shows the test I gave
> and other tests.
Old Pos() and old AnsiString, as well as ShortString, assumed native
encoding, so there existed no need for codepage conversions. UTF-8
strings deserved special care, because no subroutine could detect the
encoding of an string parameter. The new AnsiString types *should* cure
that problem, but obviously they don't (yet) :-(
DoDi
More information about the fpc-devel
mailing list