[fpc-devel] Unicode RTL

Daniël Mantione daniel.mantione at freepascal.org
Wed Nov 16 17:25:29 CET 2005



Op Wed, 16 Nov 2005, schreef Tomas Hajny:

> You're right that strings are used everywhere, but I don't think that this
> really means that you need to add special support for widestrings
> everywhere. In many places you can pass a DBCS/MBCS string to it today
> (e.g. encoded using UTF-8) and it wouldn't cause any harm. From my point
> of view, you need some kind of special support mainly for sort operations
> (which includes your TList) and then for visual classes (length of text
> for controls, etc.). In addition, you certainly need to have a proper
> routines for I/O. However, e.g. your particular example in the forum
> discussion is IMHO conceptually wrong. Turning a string around just cannot
> be performed this way (this is unsupported by design for DBCS/MBCS texts;
> not even mentioning the fact that the example is "somewhat" artificial).
> People who want to perform such an operation need to analyse and design
> the implementation properly, probably by translating the ansistring to a
> widestring first in this case. How this translation is performed is
> another question and it depends on programmer's decision. It could be that
> the string already _is_ an UCS2 string (and "translation" to widestring
> means that you just copy it byte by byte), it could be UTF-8 and it could
> be even a simple string created in particular codepage (SBCS). This is
> programmer's decision (trade-off between the widest support and the best
> performance); the same way that he has to decide whether he'd use
> multi-platform APIs or native API of a particular platform, or whether
> he'd use/import XxxxW or XxxxA API function for his Win32 application.
> 
> Maybe I'm still overlooking the real issues. Please, give me more concrete
> examples which cannot be resolved at the moment, we could discuss them
> (and then possibly come to a conclusion that separate RTL would be
> better/necessary).

*Sigh*, this going to be a long e-mail for a subject I don't interrest 
myself not much. Here we go.

There are a few models you can use:

Model 1: Be ignorant about multibyte character sets.
----------------------------------------------------

UTF-8 was designed to behave well with programs that assume US-ASCII, 
therefore you reasonable results.

If you assume nothing about the ordering of characters in the string, do 
not try to break it into pieces, do not modify them (i.e. uppercase), 
things work out in many situations.

The limitations of this model is that there are situations were the 
ordering is important, strings need to be broken up into pieces etcera.
Reversing a string is an extreme example where strings need to be broken 
into pieces, but there are way more examples.

Obviously, if code should be ignorant about the charset, people wouldn't 
be asking about Unicode support.

You can also be partially ignorant about charsets. I.e., you leave pos, 
insert etc as is and leave it up to the programmer not to do tricks like 
reversing strings.

In the case you are ignorant pos('ë','Daniël'); is a substring search 
of a string of 2 bytes into a seven byte long string.

Model 2: Use an internal encoding
---------------------------------

The UCS-2 widestring stuff is an example of this. You could also use 
UCS-4. The advantage is you can do any operation on the string you like, 
you do not get into trouble. Trouble is a lot of conversions when talking 
with the external world.

In this case pos('ë','Daniël'); is a widechar search into a widestring.

You can also decide that your internal encoding is UTF-8, no problem. In 
that case pos('ë','Daniël') is a widechar search into an ansistring.

The desirability versus UCS-2 and UTF-8 is a matter of taste. You can walk 
through UCS-2 strings with for-loops. You cannot do this with UTF-8, 
unless we would implement [] in O(n) time (or you are ignorant, model 1).

With UCS-2 you can reuse a lot of code by just changing the string type. 
You must include all stuff twice (either by making a separate Unicode rtl, 
or doubling code in your units).

With UTF-8 you can reuse a lot of code that is ignorant about character 
sets.

However, there is one big caveat. It is up to one level.

Take the Tstringlist. We can make a Twidestringlist, or we can add methods 
that do UCS-2/UTF-8 sorting and other operations.

The consequence is that all the code that uses the Tstringlist, must make 
a difference between 8-bit and UTF-8. Any code that uses Tstringlist 
should decide wether it is going to call the 8-bit or UTF-8 methodes. It 
is even worse, if that code is to be reused it needs to provide the 
option to the programmer as well, in other words, it needs to have both 
8-bit and utf-8 methods as well.

In other words, you still need to duplicate an awfull lot of code.

What convinced me two rtl's might be a better choice, is that many of the 
source code remains intact and does not need to be duplicated. New code 
could take advantage immedeately. The decision wether the code is going to 
be used in an 8-bit environment (i.e. MS-DOS) and will be 8-bit, or in a 
Unicode environment (i.e. Windows NT) and will be 16-bit a character, is 
solved by a few ifdefs. There won't even be any overhead on the MS-DOS 
executables (allthough the programmer can use widestrings if he wishes 
so).

Daniël


More information about the fpc-devel mailing list