[fpc-pascal] Generic String Functions
Sven Barth
pascaldragon at googlemail.com
Fri Feb 28 15:00:41 CET 2014
Am 28.02.2014 14:16, schrieb Michael Schnell:
> On 02/28/2014 12:53 PM, Sven Barth wrote:
>>
>> Problem: there is (currently) no string type that can handle ANSI,
>> UTF-8 and UTF-16 at once. The first two are handled by AnsiString and
>> the third by UnicodeString. And those two are not equal which would
>> be important for overrides/overloads/inheritance. Without that your
>> whole idea is useless.
>>
>
> Of course this only is relevant when "New Delphi" (i.e. "partly"
> dynamically encoded) Strings" are introduced (I decline to use the
> terms "AnsiString" and "UnicodeString" due to ambiguity, unless it
> comes with a clear definition close by).
As long as not stated differently AnsiString and UnicodeString are meant
as implemented in FPC trunk.
> Here, The Delphi model does not provide a String encoding type (and
> appropriate "compiler magic") that can be used for that purpose (i.e.
> "fully dynamically encoded").
Basically it does. In theory the additional record prepended to each
string (wich contains the reference count among others) could be used
for 1-, 2-, 4- or multi-Byte strings as it carries a "ElementSize" field
which is currently fixed to 1 for AnsiString (even with UTF-8) and to 2
for UnicodeString (both strings use the same record layout though they
are declared as different ones). Also there is the StringElementSize
function which is overloaded for RawByteString and UnicodeString and
which already returs the value of ElementSize. So purely in theory the
current AnsiString type would already be capable enough. Also the
compiler might already handle overloads correctly if we'd have a (for
now hypothetical "AnsiString(UTF16)" (which would be equal to
UnicodeString)). One of the problematic parts that already Marco
mentioned is character access. A possible solution here would be to
force the character size depending on the declared string type (2 for
AnsiString(UTF16), 4 for AnsiString(UTF32), 1 for any 1-Byte AnsiString
encoding and either 1 or 6 for UTF-8 (6 is the maximum number of Bytes
that UTF-8 might encode a character with, but it's currently the maximum
used is 4)) and not depending on the runtime type. The compiler would
then either need to insert approbiate conversions if the runtime type
does not match the declared type (for whatever reason) or the compiler
would need to assume that the runtime type always matches the declared
type. In the former case this might be quite some performance penalty
(this could be avoided if the compiler would create approbiate inline
code for detecting the runtime encoding).
An open problem left would be RTTI as there currently are tkUString (for
UnicodeString) and tkLString (for AnsiString) of which the second
contains a codepage field while the first does not. And to keep Delphi
code as much compatible as possible the compiler would then again need
to handle the RTTI of a AnsiString(UTF16) differently...
Regards,
Sven
More information about the fpc-pascal
mailing list