[fpc-devel] Unicode RTL
Daniël Mantione
daniel.mantione at freepascal.org
Wed Nov 16 17:25:29 CET 2005
Op Wed, 16 Nov 2005, schreef Tomas Hajny:
> You're right that strings are used everywhere, but I don't think that this
> really means that you need to add special support for widestrings
> everywhere. In many places you can pass a DBCS/MBCS string to it today
> (e.g. encoded using UTF-8) and it wouldn't cause any harm. From my point
> of view, you need some kind of special support mainly for sort operations
> (which includes your TList) and then for visual classes (length of text
> for controls, etc.). In addition, you certainly need to have a proper
> routines for I/O. However, e.g. your particular example in the forum
> discussion is IMHO conceptually wrong. Turning a string around just cannot
> be performed this way (this is unsupported by design for DBCS/MBCS texts;
> not even mentioning the fact that the example is "somewhat" artificial).
> People who want to perform such an operation need to analyse and design
> the implementation properly, probably by translating the ansistring to a
> widestring first in this case. How this translation is performed is
> another question and it depends on programmer's decision. It could be that
> the string already _is_ an UCS2 string (and "translation" to widestring
> means that you just copy it byte by byte), it could be UTF-8 and it could
> be even a simple string created in particular codepage (SBCS). This is
> programmer's decision (trade-off between the widest support and the best
> performance); the same way that he has to decide whether he'd use
> multi-platform APIs or native API of a particular platform, or whether
> he'd use/import XxxxW or XxxxA API function for his Win32 application.
>
> Maybe I'm still overlooking the real issues. Please, give me more concrete
> examples which cannot be resolved at the moment, we could discuss them
> (and then possibly come to a conclusion that separate RTL would be
> better/necessary).
*Sigh*, this going to be a long e-mail for a subject I don't interrest
myself not much. Here we go.
There are a few models you can use:
Model 1: Be ignorant about multibyte character sets.
----------------------------------------------------
UTF-8 was designed to behave well with programs that assume US-ASCII,
therefore you reasonable results.
If you assume nothing about the ordering of characters in the string, do
not try to break it into pieces, do not modify them (i.e. uppercase),
things work out in many situations.
The limitations of this model is that there are situations were the
ordering is important, strings need to be broken up into pieces etcera.
Reversing a string is an extreme example where strings need to be broken
into pieces, but there are way more examples.
Obviously, if code should be ignorant about the charset, people wouldn't
be asking about Unicode support.
You can also be partially ignorant about charsets. I.e., you leave pos,
insert etc as is and leave it up to the programmer not to do tricks like
reversing strings.
In the case you are ignorant pos('ë','Daniël'); is a substring search
of a string of 2 bytes into a seven byte long string.
Model 2: Use an internal encoding
---------------------------------
The UCS-2 widestring stuff is an example of this. You could also use
UCS-4. The advantage is you can do any operation on the string you like,
you do not get into trouble. Trouble is a lot of conversions when talking
with the external world.
In this case pos('ë','Daniël'); is a widechar search into a widestring.
You can also decide that your internal encoding is UTF-8, no problem. In
that case pos('ë','Daniël') is a widechar search into an ansistring.
The desirability versus UCS-2 and UTF-8 is a matter of taste. You can walk
through UCS-2 strings with for-loops. You cannot do this with UTF-8,
unless we would implement [] in O(n) time (or you are ignorant, model 1).
With UCS-2 you can reuse a lot of code by just changing the string type.
You must include all stuff twice (either by making a separate Unicode rtl,
or doubling code in your units).
With UTF-8 you can reuse a lot of code that is ignorant about character
sets.
However, there is one big caveat. It is up to one level.
Take the Tstringlist. We can make a Twidestringlist, or we can add methods
that do UCS-2/UTF-8 sorting and other operations.
The consequence is that all the code that uses the Tstringlist, must make
a difference between 8-bit and UTF-8. Any code that uses Tstringlist
should decide wether it is going to call the 8-bit or UTF-8 methodes. It
is even worse, if that code is to be reused it needs to provide the
option to the programmer as well, in other words, it needs to have both
8-bit and utf-8 methods as well.
In other words, you still need to duplicate an awfull lot of code.
What convinced me two rtl's might be a better choice, is that many of the
source code remains intact and does not need to be duplicated. New code
could take advantage immedeately. The decision wether the code is going to
be used in an 8-bit environment (i.e. MS-DOS) and will be 8-bit, or in a
Unicode environment (i.e. Windows NT) and will be 16-bit a character, is
solved by a few ifdefs. There won't even be any overhead on the MS-DOS
executables (allthough the programmer can use widestrings if he wishes
so).
Daniël
More information about the fpc-devel
mailing list