[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell
mschnell at lumino.de
Fri Nov 28 13:41:33 CET 2014
On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote:
>> The "universal paradigm" would allow for extensions (e.g. UTF-32,
>> multiple 16 Bit Code pages, an additional fully dynamic String type,
>> n-byte "un-encoded" string types), as I described in the Wiki page.
>
> Even if feasable, such arbitrary string storage can dramatically
> increase the number of implicit string conversions.
Of course it can do harm on that behalf, if the user is silly enough to
*explicitly* define variables in a brand without thinking about what he
is doing. But this exactly the same when he just uses the stuff
currently offered by Delphi and fpc. If you arbitrary define code pages
for variables for your 8 bit ("ANSI") strings you will enforce many
conversions.
Currently in Delphi if you don't define special code pages anything will
be UTF-16. So no unnecessary conversions.
In fpc (and maybe Lazarus, as well) I suppose the way currently in the
works is (when not changing the Default behavior by certain options):
- when compiling for Windows, "String" is UTF-16, and the RTL and LCL
ubiquitously use "String": So no unnecessary conversion
- when compiling for Linux, "String" is UTF-8, and the RTL and LCL
ubiquitously use "String": So no unnecessary conversion, either.
If this is done in the libraries (e.g. RTL and LCL) and in user code,
this would allow for as little conversions as possible and thus best
performance. Here, you would need different library binaries which might
or might not be a problem.
But of course the portability is very questionable (including, but not
limited to the fact that the result of "pos" is different)-
When (on top of this) doing the interfaces to libraries (including
TStrings) with "DynamicString" (encoding brand "CP_ANY"), no additional
conversions would be necessary, as - because all other Strings use the
same encoding brand (either UTF-16 or UTF-8, depending on the OS) and
hence the dynamic encoding of all DynamicStrings used would always be
exactly that brand. Hence, IMHO, this would nor harm at all, as the
overhead the compiler needs to implement to just check the dynamic type
brand and find that no conversion is necessary is extremely small.
But now the user has a choice !
- If he does not do anything regarding the encoding brand of his
strings, he will not notice the existence of the DynamicString Type at
all. Not even Performance-wise. (But he might encounter portability issues.)
- if he decides that he wants to use a dedicated encoding brand in all
or parts of his code, he of course needs to know what he is doing. This
can result
- in improved portability (if decently done)
- in improved performance (if decently done) e.g. by using on-byte
strings for compact storing the information and two-byte strings for
e.g. search loops, or using the best fitting encoding in the loops in
the user code while allowing auto-conversion when accessing the
libraries in case the underlying OS enforces a different encoding.
- in disastrous increase of auto-conversions and thus performance
degradation, (if not decently done).
> An *efficient* implementation would be based on a single program-wide
> string representation, with different encodings being handled only in
> an exchange with external data sources.
Yep. But it would result in severe user code portability issues (see
above). IMHO using DynamicString at the correct locations would not be
(noticeably) less efficient but a lot more versatile.
>
>
> <Cassandra>
> After all I have the impression that the known RawByteString flaws
> will never be fixed in Delphi, in order to encourage the users to take
> the step to UnicodeString. Now the question is whether these flaws are
> fixed in FPC, or whether Lazarus will become the first project that
> definitely requires an complete move to UnicodeString, for reliable
> operation.
> For best support of non-UTF-16 platforms I'd suggest to fix the flaws...
> </Cassandra>
I also don't think we will ever see a fix for the poor implementation of
RawByteString (avoiding the word flaw and the suggestion of a bad
purpose), because it would brake existing user code.
Regarding fpc, "correcting the flaws" and keeping the name RawByteString
would result in incompatibility issues vs Delphi and breaking code that
will be ported from Delphi.
That is why fpc would need to define an additional type name (e.g
"DynamicString") and encoding brand number (e.g. "CP_ANY" = $FF00) for a
decently usable type for intermediately holding a String content. (see
Wiki ->
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support
)
RawXxxString can be used for really "uncoded" data as done with
old-style strings in a lot of applications. Even if "seriously flawed"
auto-conversion might be implemented in fpc for RawByteStrimg (for
Delphi-compatibility), the user can easily avoid it by not directly
combining RAW and differently statically encoded strings in an operation.
-Michael
More information about the fpc-devel
mailing list