[fpc-pascal] Generic String Functions

Sven Barth pascaldragon at googlemail.com
Fri Feb 28 15:00:41 CET 2014


Am 28.02.2014 14:16, schrieb Michael Schnell:
> On 02/28/2014 12:53 PM, Sven Barth wrote:
>>
>> Problem: there is (currently) no string type that can handle ANSI, 
>> UTF-8 and UTF-16 at once. The first two are handled by AnsiString and 
>> the third by UnicodeString. And those two are not equal which would 
>> be important for overrides/overloads/inheritance. Without that your 
>> whole idea is useless.
>>
>
> Of course this only is relevant when "New Delphi" (i.e. "partly" 
> dynamically encoded) Strings" are introduced (I decline to use the 
> terms "AnsiString" and "UnicodeString" due to ambiguity, unless it 
> comes with a clear definition close by).
As long as not stated differently AnsiString and UnicodeString are meant 
as implemented in FPC trunk.
> Here, The Delphi model does not provide a String encoding type (and 
> appropriate "compiler magic") that can be used for that purpose (i.e. 
> "fully dynamically encoded").
Basically it does. In theory the additional record prepended to each 
string (wich contains the reference count among others) could be used 
for 1-, 2-, 4- or multi-Byte strings as it carries a "ElementSize" field 
which is currently fixed to 1 for AnsiString (even with UTF-8) and to 2 
for UnicodeString (both strings use the same record layout though they 
are declared as different ones). Also there is the StringElementSize 
function which is overloaded for RawByteString and UnicodeString and 
which already returs the value of ElementSize. So purely in theory the 
current AnsiString type would already be capable enough. Also the 
compiler might already handle overloads correctly if we'd have a (for 
now hypothetical "AnsiString(UTF16)" (which would be equal to 
UnicodeString)). One of the problematic parts that already Marco 
mentioned is character access. A possible solution here would be to 
force the character size depending on the declared string type (2 for 
AnsiString(UTF16), 4 for AnsiString(UTF32), 1 for any 1-Byte AnsiString 
encoding and either 1 or 6 for UTF-8 (6 is the maximum number of Bytes 
that UTF-8 might encode a character with, but it's currently the maximum 
used is 4)) and not depending on the runtime type. The compiler would 
then either need to insert approbiate conversions if the runtime type 
does not match the declared type (for whatever reason) or the compiler 
would need to assume that the runtime type always matches the declared 
type. In the former case this might be quite some performance penalty 
(this could be avoided if the compiler would create approbiate inline 
code for detecting the runtime encoding).
An open problem left would be RTTI as there currently are tkUString (for 
UnicodeString) and tkLString (for AnsiString) of which the second 
contains a codepage field while the first does not. And to keep Delphi 
code as much compatible as possible the compiler would then again need 
to handle the RTTI of a AnsiString(UTF16) differently...

Regards,
Sven



More information about the fpc-pascal mailing list