[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Wed Dec 3 05:02:13 CET 2014

Michael Schnell schrieb:
> On 11/28/2014 09:15 PM, Hans-Peter Diettrich wrote:

>> Apart from that, every encoding-tolerant code will execute much slower 
>> than code without a need for checks and conversions everywhere.
> As I pointed out I don't agree at all.
>  - The check is only two ASM instructions
>  - It does not result in additional conversions.

It does, e.g. in searching or sorting of StringList, when it can contain
strings of different encodings. The choice of a unique encoding for
application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such
conversions.

> So the "Checking Overhead" is nothing but a rumor. (Remember, I don't 
> suggest dropping the standard "statically typed" paradigm, altogether, 
> as close loops of course work best in that way.

The rumor is the unimportant "Conversion Overhead", i.e. how often a
check leads to a conversion. When no check is required, conversions
consequently cannot ocur at all.

>>> RawXxxString can be used for really "uncoded" data as done with 
>>> old-style strings in a lot of applications.
>>
>> Such a feature would be appreciated by many users, indeed :-)

> But why do you say "would be appreciated" ? Is it not possible to use 
> "RawByteString" in a way the name suggests, by never bringing it 
> together with any String variable of a different encoding brand and 
> hence avoid any conversion - be same intentional/documented/useful or not.

RawByteString cannot serve two different purposes :-(

In *Delphi* it is used as a polymorphic string, capable of *holding*
actual strings of any encoding. But when assigned to a variable of a
different encoding, a conversion may occur that converts the string into
the declared (static) encoding of the target variable.

In *FPC* it currently is used somewhat close to your idea, i.e. no
conversion occurs in both an assignment to *and from* an RawByteString
to some other AnsiString. We only can *hope* that *all* AnsiString
operations are based on the dynamic encoding of every operand, with
according checks and conversions inserted everywhere. This actually is
not true, because the compiler relies on the static encoding of
AnsiString variables, and inserts checks and conversions only when that
encoding is different. Actually a single AnsiString type were
sufficient, because it already can hold data of any encoding :-(

I understand the FPC attempt, to allow *at the same time* for the new
(encoded) and old (unencoded) AnsiString behaviour, where no automatic
conversions are allowed. But this would require at the same time, that
e.g. all string literals *also* are stored in that (immutable) encoding,
and that this encoding can *not* be changed at runtime, while
DefaultSystemCodePage *can* be changed.

When the result of a conversion of an string of encoding CP_NONE is
undefined, what's of course correct for the *dynamic* encoding, this
simply could be changed into "conversions of CP_NONE strings do
nothing". Then CP_NONE would be the perfect encoding for old-style
AnsiStrings, with the only remaining problem with string expressions and
assignments, when the operands have a different dynamic encoding. In
these cases all operands had to be converted into the CP_NONE encoding,
as specified in another DefaultNoneEncoding constant (not variable!);
the same encoding would apply in assignments *to* variables of a
different encoding. Then also all type alias for AnsiStrings must have
unique names, which allow to distinguish e.g.
   type UTF8String = AnsiString;
from
   type NewUTF8String = type AnsiString(CP_UTF8);

DoDi