[fpc-devel] Delphi new AnsiStrings are incredibly broken :-(
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Fri Oct 14 14:23:00 CEST 2011
Apart from the mentioned implementation flaws, I came across severe
problems with the new AnsiString *model* in general. Let's play around
with the Pos() function, which certainly is an inevitable part of any
stringhandling.
A general function
function Pos(SubStr: T1; Str: T2): integer;
should return the character index of SubStr in Str, i.e. Str[i] should
definitely be the begin of SubStr within Str.
It also should be possible to find the end of SubStr within Str, in
order to e.g. return the remainder of the text.
With multiple coexisting string encodings we have to solve the following
problems:
A reasonable result, i.e. the index in the given string, of the given
encoding T2, will require to convert the search string SubStr into
exactly that encoding. This requires two conversions, from T1 into UTF-8
(or UTF-16) and then into T2. Clearly this can be prevented by using
strings of only one encoding, but what about string literals? When a
string literal has to be converted, it most probably ends up in UTF-8/16
encoding, what would cause the Unicode version of Pos() being called,
resulting in a wrong result. Even if we assume that string literals are
stored as native (CP_ACP) strings, or as Unicode, what actually depends
on compiler directives, a couple of overloaded Pos() functions had to be
added, when an unwanted conversion of *both* arguments into UTF-16 shall
be avoided.
The only possible solution were IMO a
function Pos(SubStr: UnicodeString; Str: RawByteString): integer;
in the *hope* that this version takes precedence over the all-Unicode
version.
But when we have the begin of the substring, how do we find its end?
Here Length(SubStr) is of little help, since it represents the number of
bytes in encoding T1, useless with T2. So we need a feature to determine
the length of an string in any (supported) encoding, like:
function EncodedLength(s: string; cp: TEncoding): integer;
Or we add a function
function EndPos(SubStr: T1; Str: T2): integer;
returning the index of the char following SubStr in Str.
Or we combine both, into
function Pos2(SubStr: T1; Str: T2; out begIndex, endIndex: integer):
boolean;
with the result eventually indicating whether SubStr was found in Str.
But even if we implement all that, and use it *everywhere* in our code,
the chance for any number of implicit encoding conversions remains :-(
Do you see any chance to reduce the number of possible conversions,
other than by using only one single encoding throughout RTL and
application code?
But what's the use of strings with a stored encoding, then? Except for
strict compatibilty with a flawed Delphi model and implementation, that
may be dropped again in the next Delphi version?
DoDi
More information about the fpc-devel
mailing list