[fpc-devel] Unicode RTL

Wed Nov 16 23:30:30 CET 2005

On Wed, 16 Nov 2005 17:25:29 +0100 (CET)
Daniël Mantione <daniel.mantione at freepascal.org> wrote:

> 
> 
> Op Wed, 16 Nov 2005, schreef Tomas Hajny:
> 
> > You're right that strings are used everywhere, but I don't think that
> > this really means that you need to add special support for widestrings
> > everywhere. In many places you can pass a DBCS/MBCS string to it today
> > (e.g. encoded using UTF-8) and it wouldn't cause any harm. From my point
> > of view, you need some kind of special support mainly for sort
> > operations (which includes your TList) and then for visual classes
> > (length of text for controls, etc.). In addition, you certainly need to
> > have a proper routines for I/O. However, e.g. your particular example in
> > the forum discussion is IMHO conceptually wrong. Turning a string around
> > just cannot be performed this way (this is unsupported by design for
> > DBCS/MBCS texts; not even mentioning the fact that the example is
> > "somewhat" artificial). People who want to perform such an operation
> > need to analyse and design the implementation properly, probably by
> > translating the ansistring to a widestring first in this case. How this
> > translation is performed is another question and it depends on
> > programmer's decision. It could be that the string already _is_ an UCS2
> > string (and "translation" to widestring means that you just copy it byte
> > by byte), it could be UTF-8 and it could be even a simple string created
> > in particular codepage (SBCS). This is programmer's decision (trade-off
> > between the widest support and the best performance); the same way that
> > he has to decide whether he'd use multi-platform APIs or native API of a
> > particular platform, or whether he'd use/import XxxxW or XxxxA API
> > function for his Win32 application.
> > 
> > Maybe I'm still overlooking the real issues. Please, give me more
> > concrete examples which cannot be resolved at the moment, we could
> > discuss them (and then possibly come to a conclusion that separate RTL
> > would be better/necessary).
> 
> *Sigh*, this going to be a long e-mail for a subject I don't interrest 
> myself not much. Here we go.
> 
> There are a few models you can use:
> 
> Model 1: Be ignorant about multibyte character sets.
> ----------------------------------------------------
> 
> UTF-8 was designed to behave well with programs that assume US-ASCII, 
> therefore you reasonable results.
> 
> If you assume nothing about the ordering of characters in the string, do 
> not try to break it into pieces, do not modify them (i.e. uppercase), 
> things work out in many situations.
> 
> The limitations of this model is that there are situations were the 
> ordering is important, strings need to be broken up into pieces etcera.
> Reversing a string is an extreme example where strings need to be broken 
> into pieces, but there are way more examples.
> 
> Obviously, if code should be ignorant about the charset, people wouldn't 
> be asking about Unicode support.
> 
> You can also be partially ignorant about charsets. I.e., you leave pos, 
> insert etc as is and leave it up to the programmer not to do tricks like 
> reversing strings.
> 
> In the case you are ignorant pos('ë','Daniël'); is a substring search 
> of a string of 2 bytes into a seven byte long string.

I don't understand, why you connect UTF8 with 'ignorant of MBCS'.
UCS-2 can be used as ignorant as UTF8.
Even UCS-4 and UTF32 will not solve all problems. Think about arabic RTL.

You must extend old souce code if you want to support all languages anyway.
Widestrings lets you keep some old code and introduces some new problems.
The same is true for UTF8.
That's a matter of choice and heavily depends on the old code.
What's more important, is that widestring needs sometimes two widecharacters
for one character. So, you have MBCS problems too.
For Lazarus we decided to use UTF8, because 
- UCS-4/UTF32 is too much memory overhead. That means we must use a 1 or 2
byte encoding, which implies, that we have to implement MBCS functions
anyway.
- UTF8 works with ASCII without conversion

> Model 2: Use an internal encoding
> ---------------------------------
> 
> The UCS-2 widestring stuff is an example of this. You could also use 
> UCS-4. The advantage is you can do any operation on the string you like, 
> you do not get into trouble. Trouble is a lot of conversions when talking 
> with the external world.
> 
> In this case pos('ë','Daniël'); is a widechar search into a widestring.
> 
> You can also decide that your internal encoding is UTF-8, no problem. In 
> that case pos('ë','Daniël') is a widechar search into an ansistring.
> 
> The desirability versus UCS-2 and UTF-8 is a matter of taste. You can walk
> 
> through UCS-2 strings with for-loops. You cannot do this with UTF-8, 
> unless we would implement [] in O(n) time (or you are ignorant, model 1).
> 
> With UCS-2 you can reuse a lot of code by just changing the string type. 
> You must include all stuff twice (either by making a separate Unicode rtl,
> or doubling code in your units).
> 
> With UTF-8 you can reuse a lot of code that is ignorant about character 
> sets.
> 
> However, there is one big caveat. It is up to one level.
> 
> Take the Tstringlist. We can make a Twidestringlist, or we can add methods
> that do UCS-2/UTF-8 sorting and other operations.
> 
> The consequence is that all the code that uses the Tstringlist, must make 
> a difference between 8-bit and UTF-8. Any code that uses Tstringlist 
> should decide wether it is going to call the 8-bit or UTF-8 methodes. It 
> is even worse, if that code is to be reused it needs to provide the 
> option to the programmer as well, in other words, it needs to have both 
> 8-bit and utf-8 methods as well.
> 
> In other words, you still need to duplicate an awfull lot of code.

That is the same for 8bit and widestring.

> What convinced me two rtl's might be a better choice, is that many of the 
> source code remains intact and does not need to be duplicated. New code 
> could take advantage immedeately. The decision wether the code is going to
> be used in an 8-bit environment (i.e. MS-DOS) and will be 8-bit, or in a 
> Unicode environment (i.e. Windows NT) and will be 16-bit a character, is 
> solved by a few ifdefs. There won't even be any overhead on the MS-DOS 
> executables (allthough the programmer can use widestrings if he wishes 
> so).

Please: No two RTLs.

Mattias