[fpc-devel] RawString
Michael Schnell
mschnell at lumino.de
Fri Jun 28 09:55:56 CEST 2013
Sorry for not being able to keep my mouth shut. (And in fact I don't
expect an answer to this contribution.)
I don't want to step on the feet of the compiler developers at all. My
argument always is "an fpc implementation with some leftover Embarcadero
misconceptions is a lot better than no fpc at all". Thus it is theirs to
decide what they do and what they don't.
Anyway I'd like to resume what can be learned from the long winding
discussion (rather displaced) in the thread " Performance of string
handling in trunk" (and from several similar discussions we had,
starting at least a year ago, when implementation of the "new String
Type" was initially discussed).
1) When doing an assignment, the compiler uses the static encoding types
of "new type of" string variables, to decide whether to generate code to
call a conversion function or to just copy the pointer to the string
record (and handle the reference counting). Supposedly the library
function uses the dynamic encoding type setting to decide what
conversion to do.
2) An assignment of a normal (strictly encoded) string variable to a
RawByteString is done by just copying the pointer to the string Record
to the target RawByteString variable. This results in the dynamic
encoding of the RawByteString to be set (correctly) according to the
encoding used.
3) It seems like in DXE (what version of same?) an assignment from a
normal (strictly encoded) to a RawByteString is possible and it is done
by just copying the pointer to the target string variable. This is a
common assumption. I did not find any decent documentation on this.On
the contrary the Delphi docs depreciate using RawByteStrings for normal
users. I also did not hear from anybody who tested this.
4) If such assignment is done in the way assumed, and the dynamic
encoding type of the RawByteString does not match the static encoding
type of the target. We will get an "intersexual" string with mismatched
static and dynamic encoding type. The behavior of such a beast is
unpredictable and thus I consider the possibility to this easily create
it as a quirk.
5) In DXE the TStringList (and supposedly TStrings) class uses "String"
as it's user interface, In Delphi, "String" is mapped to a Type that is
strictly encoded as a (windowish) two-Byte Unicode Type. This results in
a huge performance hit when using it with a (normal) string variable
with another encoding type, as storing and retrieving results in a dual
conversion.
6) Modifying the behavior in a way that avoids this quirk would be
rather easy, but even the slightest incompatibility (here only hitting
when the normal Delphi user does something which is very unlikely
desired in Windows applications, creates unpredictable behavior, and
explicitly is discouraged in the DXE docs) does not seem acceptable. (I
don't want to disagree to this.)
Thus, IMHO, an decently acceptable way to go would be to introduce yet
another String Type I'd like to call "RawString".
RawString should behave exactly like RawByteString (i.e.: it is 1:1
bit-compatible sauf the single (in Delphi Programs close to never used)
case described below. The handling of it's dynamic encoding type thus is
identical (no difference regarding the conversion library functions). It
only features a different static encoding type at compile time.
Now the difference vs RawByteString is, that when assigning a RawString
to a normal String the compiler creates code that compares the dynamic
encoding of the source with the static encoding of the target (an empty
target does not have a dynamic encoding). If equal it just does a
pointer assignment, if unequal it does the normal stuff to call the
conversion library function.
The performance degradation is zero in nearly all decent legacy code (as
this action never is done) and only some three additional assembler
instructions vs the traditional use of "RawByteString" (in the non
quirky case that the encoding is correct.)
IMHO assigning a RawByteString to a normal String should result in a
compiler error (or in am Exception in the case of mismatching encoding
types), unless some special quirk mode is set.
While all this is only a "theoretical" improvement (avoiding a
misbehavior in a depreciated case) it becomes valuable when accepting
RawString as a new and useful language feature. This of course means
that implementation of RawString in fact only makes sense when the
library function (such as RTL and LCL) supporters are willing to make
use of it.
While in Delphi any use of another String encoding but the 2 Byte
Unicode variant is not recommended, in other OSes it can be very
advantageous to have decent language and RTL support for different
encoding types.
Here especially TStrings decedents such TStringList comes into view. If
same use RawString in their user interface, a dual conversion can be
avoided when storing and retrieving data with encoding different from
the Unicode 2 Bytes style.
Of course this leads to some severe afterthoughts:
A) StringList needs to be newly implemented in an encoding type
independent way.
B) For compatibility reasons, it needs to be possible to do decedents of
TStringList that offer an interface using normal (strictly encoded)
strings. I don't know, if/how the compiler needs to be enhanced to allow
for this (are the different string variants "compatible" in that way).
BTW.: the now "legacy" RawByteString is free for using it in the way
it's name suggests: to hold an array of Bytes. This is supported by
never converting its content. At best it should be allowed to
concatenate it with a variable of type Byte and to retrieve a byte by
myRawString[i] without a type override.
Now again it also makes sense to implement RawWordString,
RawDWordString, and RawQWordString, as well, in the obvious way :-)
-Michael
More information about the fpc-devel
mailing list