[fpc-devel] Delphi new AnsiStrings are incredibly broken :-(

Fri Oct 14 14:23:00 CEST 2011

Apart from the mentioned implementation flaws, I came across severe 
problems with the new AnsiString *model* in general. Let's play around 
with the Pos() function, which certainly is an inevitable part of any 
stringhandling.

A general function
  function Pos(SubStr: T1; Str: T2): integer;
should return the character index of SubStr in Str, i.e. Str[i] should 
definitely be the begin of SubStr within Str.

It also should be possible to find the end of SubStr within Str, in 
order to e.g. return the remainder of the text.

With multiple coexisting string encodings we have to solve the following 
problems:

A reasonable result, i.e. the index in the given string, of the given 
encoding T2, will require to convert the search string SubStr into 
exactly that encoding. This requires two conversions, from T1 into UTF-8 
(or UTF-16) and then into T2. Clearly this can be prevented by using 
strings of only one encoding, but what about string literals? When a 
string literal has to be converted, it most probably ends up in UTF-8/16 
encoding, what would cause the Unicode version of Pos() being called, 
resulting in a wrong result. Even if we assume that string literals are 
stored as native (CP_ACP) strings, or as Unicode, what actually depends 
on compiler directives, a couple of overloaded Pos() functions had to be 
added, when an unwanted conversion of *both* arguments into UTF-16 shall 
be avoided.

The only possible solution were IMO a
  function Pos(SubStr: UnicodeString; Str: RawByteString): integer;
in the *hope* that this version takes precedence over the all-Unicode 
version.

But when we have the begin of the substring, how do we find its end?
Here Length(SubStr) is of little help, since it represents the number of 
bytes in encoding T1, useless with T2. So we need a feature to determine 
the length of an string in any (supported) encoding, like:
   function EncodedLength(s: string; cp: TEncoding): integer;

Or we add a function
  function EndPos(SubStr: T1; Str: T2): integer;
returning the index of the char following SubStr in Str.

Or we combine both, into
  function Pos2(SubStr: T1; Str: T2; out begIndex, endIndex: integer): 
boolean;
with the result eventually indicating whether SubStr was found in Str.

But even if we implement all that, and use it *everywhere* in our code, 
the chance for any number of implicit encoding conversions remains :-(

Do you see any chance to reduce the number of possible conversions, 
other than by using only one single encoding throughout RTL and 
application code?

But what's the use of strings with a stored encoding, then? Except for 
strict compatibilty with a flawed Delphi model and implementation, that 
may be dropped again in the next Delphi version?

DoDi