[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Wed Dec 3 10:42:37 CET 2014

On 12/03/2014 05:02 AM, Hans-Peter Diettrich wrote:
> Michael Schnell schrieb:
>>  - It does not result in additional conversions.
> It does, e.g. in searching or sorting of StringList, when it can contain
> strings of different encodings. The choice of a unique encoding for
> application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such
> conversions.
If multiple encoding brands are involved, a system without DynamicString 
also will need to do conversions. So DynamicString does not impose 
*additional* conversions.-

>> So the "Checking Overhead" is nothing but a rumor. (Remember, I don't 
>> suggest dropping the standard "statically typed" paradigm, 
>> altogether, as close loops of course work best in that way.
> The rumor is the unimportant "Conversion Overhead", i.e. how often a
> check leads to a conversion. When no check is required, conversions
> consequently cannot ocur at all.
Please re-read the text I wrote.
  - If in the user-code DynamicString is not used, the compiler creates 
the same code as before. So no overhead.
  - If DynamicString is used (in user-Code or in a Library interface), 
but only a single encoding brand is used everywhere where statically 
encoded strings are in place ("a single program-wide string 
representation" as you suggested in you previous mail) the only runtime 
overhead imposed is that at the locations where DynamicString is used 
(i.e. not in any close loops) an additional check for the "EncodingType" 
variable is implemented by the compiler. Here (unless the user actively 
decides to create string variables with encoding brands other than the 
program-wide default) at runtime the code *always* finds that no 
conversion is necessary and acts as if the String would not be dynamic, 
but already "correct". The overhead of checking is obviously at most 
some 5 ASM instructions and hence unelectable regarding the function 
call assigned to entering the library function in question.

> RawByteString cannot serve two different purposes :-( ....
As I pointed out as well: A variable' encoding brand can't be static and 
dynamic at the same time. This is the cause of the major misconception 
imposed by Delphi regarding RawByteString. And this is why I would leave 
RawByteString aside (as it is / as it is assumed to be / whatever) and 
for any improvement use a completely new Type name and a "CP_ANY" 
constant / value.

>
> In *Delphi* it is used as a polymorphic string, capable of *holding*
> actual strings of any encoding. But when assigned to a variable of a
> different encoding, a conversion may occur that converts the string into
> the declared (static) encoding of the target variable.
Seemingly rather close to what I suggest as "DynamicString". But (see 
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support 
) with a dynamic String the encoding brand number of such String would 
not be allowed to ever be written into the EncodingType field in the 
string header.

If this would be true, why do the Delhi Docs discourage making decent 
use of  the dynamic feature of RawByteString  ?

Anyway. A "dynamic" String type only makes sense if it is used in as 
many library interfaces (and TStrings). This is not done in Delphi and 
in Delphi this is not nice, in many cases restricting the user to make 
use of these libraries, but not as critical as with fpc, where you need 
to consider portability issues.

>
> In *FPC* it currently is used somewhat close to your idea, i.e. no
> conversion occurs in both an assignment to *and from* an RawByteString
> to some other AnsiString. 

As said, to avoid ambiguity, I vote for adding yet another string type 
name (e.g. "ByteString" denoted by CP_BYTE) that is *known* to disallow 
any conversion (and leave RawByteString as close as possible to the 
moving target Delphi presents).

>
> I understand the FPC attempt, to allow *at the same time* for the new
> (encoded) and old (unencoded) AnsiString behaviour, where no automatic
> conversions are allowed. But this would require at the same time, that
> e.g. all string literals *also* are stored in that (immutable) encoding,
> and that this encoding can *not* be changed at runtime, while
> DefaultSystemCodePage *can* be changed.

I feel that this (simplified) attempt can't result in a decent paradigm. 
It is close to impossible to completely describe the behavior in an 
understandable way and it's prone to a lot of ambiguity.

That is why I tried to invent a concept that I suppose might work and 
will not break (much) existing code. It is intended to be "straight" 
from ground up (it is not even necessary to assume that the content of a 
"String" is printable/readable, but it should easily work for that 
application.) It would allow for making flexible use of Strings with 
understandable and easy to use syntax candy, and would not impose  
restrictions to portability any more. IMHO it would not impose 
(noticeable) performance degradation, either.

-Michael