[fpc-devel] RawString

Michael Schnell mschnell at lumino.de
Fri Jun 28 09:55:56 CEST 2013


Sorry for not being able to keep my mouth shut. (And in fact I don't 
expect an answer to this contribution.)

I don't want to step on the feet of the compiler developers at all. My 
argument always is "an fpc implementation with some leftover Embarcadero 
misconceptions is a lot better than no fpc at all". Thus it is theirs to 
decide what they do and what they don't.

Anyway I'd like to resume what can be learned from the long winding 
discussion (rather displaced) in the thread " Performance of string 
handling in trunk" (and from several similar discussions we had, 
starting at least a year ago, when implementation of the "new String 
Type" was initially discussed).


1) When doing an assignment, the compiler uses the static encoding types 
of "new type of" string variables, to decide whether to generate code to 
call a conversion function or to just copy the pointer to the string 
record (and handle the reference counting). Supposedly the library 
function uses the dynamic encoding type setting to decide what 
conversion to do.

2) An assignment of a normal (strictly encoded) string variable to a 
RawByteString is done by just copying the pointer to the string Record 
to the target RawByteString variable. This results in the dynamic 
encoding of the RawByteString to be set (correctly) according to the 
encoding used.

3) It seems like in DXE (what version of same?) an assignment from a 
normal (strictly encoded) to a RawByteString is possible and it is done 
by just copying the pointer to the target string variable. This is a 
common assumption. I did not find any decent documentation on this.On 
the contrary the Delphi docs depreciate using RawByteStrings for normal 
users. I also did not hear from anybody who tested this.

4) If such assignment is done in the way assumed, and the dynamic 
encoding type of the RawByteString does not match the static encoding 
type of the target. We will get an "intersexual" string with mismatched 
static and dynamic encoding type. The behavior of such a beast is 
unpredictable and thus I consider the possibility to this easily create 
it as a quirk.

5) In DXE the TStringList (and supposedly TStrings) class  uses "String" 
as it's user interface, In Delphi, "String" is mapped to a Type that is 
strictly encoded as a (windowish) two-Byte Unicode Type. This results in 
a huge performance hit when using it with a (normal) string variable 
with another encoding type, as storing and retrieving results in a dual 
conversion.

6) Modifying the behavior in a way that avoids this quirk would be 
rather easy, but even the slightest incompatibility (here only hitting 
when the normal Delphi user does something which is very unlikely 
desired in Windows applications, creates unpredictable behavior, and 
explicitly is discouraged in the DXE docs) does not seem acceptable. (I 
don't want to disagree to this.)


Thus, IMHO, an decently acceptable way to go would be to introduce yet 
another String Type I'd like to call "RawString".

RawString should behave exactly like RawByteString (i.e.: it is 1:1 
bit-compatible sauf the single (in Delphi Programs close to never used) 
case described below. The handling of it's dynamic encoding type thus is 
identical (no difference regarding the conversion library functions). It 
only features a different static encoding type at compile time.

Now the difference vs RawByteString is, that when assigning a RawString 
to a normal String the compiler creates code that compares the dynamic 
encoding of the source with the static encoding of the target (an empty 
target does not have a dynamic encoding). If equal it just does a 
pointer assignment, if unequal it does the normal stuff to call the 
conversion library function.

The performance degradation is zero in nearly all decent legacy code (as 
this action never is done) and only some three additional assembler 
instructions vs the traditional use of "RawByteString" (in the non 
quirky case that the encoding is correct.)

IMHO assigning a RawByteString to a normal String should result in a 
compiler error (or in am Exception in the case of mismatching encoding 
types), unless some special quirk mode is set.



While all this is only a "theoretical" improvement (avoiding a 
misbehavior in a depreciated case) it becomes valuable when accepting 
RawString as a new and useful language feature. This of course means 
that implementation of RawString in fact only makes sense when the 
library function (such as RTL and LCL) supporters are willing to make 
use of it.



While in Delphi any use of another String encoding but the 2 Byte 
Unicode variant is not recommended, in other OSes it can be very 
advantageous to have decent language and RTL support for different 
encoding types.

Here especially TStrings decedents such TStringList comes into view. If 
same use RawString in their user interface, a dual conversion can be 
avoided when storing and retrieving data with encoding different from 
the Unicode 2 Bytes style.



Of course this leads to some severe afterthoughts:

A) StringList needs to be newly implemented in an encoding type 
independent way.

B) For compatibility reasons, it needs to be possible to do decedents of 
TStringList that offer an interface using normal (strictly encoded) 
strings. I don't know, if/how the compiler needs to be enhanced to allow 
for this (are the different string variants "compatible" in that way).


BTW.: the now "legacy" RawByteString is free for using it in the way 
it's name suggests: to hold an array of Bytes. This is supported by 
never converting its content. At best it should be allowed to 
concatenate it with a variable of type Byte and to retrieve a byte by 
myRawString[i] without a type override.

Now again it also makes sense to implement RawWordString, 
RawDWordString, and RawQWordString, as well, in the obvious way :-)

-Michael





More information about the fpc-devel mailing list