[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Fri Nov 28 13:41:33 CET 2014

On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote:
>> The "universal paradigm" would allow for extensions (e.g. UTF-32, 
>> multiple 16 Bit Code pages, an additional fully dynamic String type, 
>> n-byte "un-encoded" string types), as I described in the Wiki page.
>
> Even if feasable, such arbitrary string storage can dramatically 
> increase the number of implicit string conversions. 

Of course it can do harm on that behalf, if the user is silly enough to 
*explicitly* define variables in a brand without thinking about what he 
is doing. But this exactly the same when he just uses the stuff 
currently offered by Delphi and fpc. If you arbitrary define code pages 
for variables for your 8 bit ("ANSI") strings you will enforce many 
conversions.

Currently in Delphi if you don't define special code pages anything will 
be UTF-16. So no unnecessary conversions.

In fpc (and maybe Lazarus, as well) I suppose the way currently in the 
works is (when not changing the Default behavior by certain options):
  - when compiling for Windows, "String" is UTF-16, and the RTL and LCL 
ubiquitously use "String": So no unnecessary conversion
  - when compiling for Linux,  "String" is UTF-8, and the RTL and LCL 
ubiquitously use "String": So no unnecessary conversion, either.

If this is done in the libraries (e.g. RTL and LCL) and in user code, 
this would allow for as little conversions as possible and thus best 
performance. Here, you would need different library binaries which might 
or might not be a problem.

But of course the portability is very questionable (including, but not 
limited to the fact that the result of "pos" is different)-

When (on top of this) doing the interfaces to libraries (including 
TStrings) with "DynamicString" (encoding brand "CP_ANY"), no additional 
conversions would be necessary, as - because all other Strings use the 
same encoding brand (either UTF-16 or UTF-8, depending on the OS) and 
hence the dynamic encoding of all DynamicStrings used would always be 
exactly that brand. Hence, IMHO, this would nor harm at all, as the 
overhead the compiler needs to implement to just check the dynamic type 
brand and find that no conversion is necessary is extremely small.

But now the user has a choice !

  - If he does not do anything regarding the encoding brand of his 
strings, he will not notice the existence of the DynamicString Type at 
all. Not even Performance-wise. (But he might encounter portability issues.)
  - if he decides that he wants to use a dedicated encoding brand in all 
or parts of his code, he of course needs to know what he is doing. This 
can result
    - in improved portability (if decently done)
    - in improved performance (if decently done) e.g. by using on-byte 
strings for compact storing the information and two-byte strings for 
e.g. search loops, or using the best fitting encoding in the loops in 
the user code while allowing auto-conversion when accessing the 
libraries in case the underlying OS enforces a different encoding.
    - in disastrous increase of auto-conversions and thus performance 
degradation, (if not decently done).

> An *efficient* implementation would be based on a single program-wide 
> string representation, with different encodings being handled only in 
> an exchange with external data sources.
Yep. But it would result in severe user code portability issues (see 
above). IMHO using DynamicString at the correct locations would not be 
(noticeably) less efficient but a lot more versatile.
>
>
> <Cassandra>
> After all I have the impression that the known RawByteString flaws 
> will never be fixed in Delphi, in order to encourage the users to take 
> the step to UnicodeString. Now the question is whether these flaws are 
> fixed in FPC, or whether Lazarus will become the first project that 
> definitely requires an complete move to UnicodeString, for reliable 
> operation.
> For best support of non-UTF-16 platforms I'd suggest to fix the flaws...
> </Cassandra>
I also don't think we will ever see a fix for the poor implementation of 
RawByteString (avoiding the word flaw and the suggestion of a bad 
purpose), because it would brake existing user code.
Regarding fpc, "correcting the flaws" and keeping the name RawByteString 
would result in incompatibility issues vs Delphi and breaking code that 
will be ported from Delphi.

That is why fpc would need to define an additional type name (e.g 
"DynamicString") and encoding brand number (e.g. "CP_ANY" = $FF00) for a 
decently usable type for intermediately holding a  String content. (see 
Wiki -> 
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support 
)

RawXxxString can be used for really "uncoded" data as done with 
old-style strings in a lot of applications. Even if "seriously flawed" 
auto-conversion might be implemented in fpc for RawByteStrimg (for 
Delphi-compatibility), the user can easily avoid it by not directly 
combining RAW and differently statically encoded strings in an operation.

-Michael