[fpc-devel] String and UnicodeString and UTF8String

Thu Jan 13 18:57:00 CET 2011

Marco van de Voort schrieb:
> In our previous episode, Hans-Peter Diettrich said:
>>>>> "non-native" strings, it can also be a performance win).
>>>> IMO a single encoding, i.e. UTF-8, can cover all cases.
>>> Well, for starters, it doesn't cover the existing Delphi/unicode codebase.
>> Because it's bound to UTF-16? That's not a problem, because WideString 
>> will continue to exist, and according conversions are still inserted by 
>> the compiler.
> 
> That is DIY compatibility, or, in other words, no compaibility.

I still don't understand the problem :-(

> Widestring will also grind the application to a halt due to being COM based
> on Windows.

How that?

>> When system encoding changes with the target platform, indexed access to 
>> such strings can lead to different results. Unless the compiler can read 
>> the coder's mind...
> 
> You don't have to. The Delphi model provides a stringtype for the system
> encoding, and then as such all strings from the system can be labeled. With
> other stringtypes, the necessary conversions can be edited.

Indexed string access produces other results for Ansi and UTF-8 system 
encoding. Such code is not portable, and the data (ini files) are not, 
too. Allowing for UTF-8 as the system encoding will frustrate Windows 
users (dunno whether Windows allows for such a system encoding), and 
Linux users are frustrated when UTF-8 is disallowed.

Only solution: using OS encoding restricts the code to run on a single 
machine only, or on similarly configured machines.

The group of users, which accept this restriction, will be happy with a 
single AnsiString type and no implicit conversions. Without implicit 
conversions such a string type can hold UTF-8 as well.

> Likewise, e.g. win32 console routines can be labeled with OEMString. (Since
> windows uses a different default encoding for the console)

This either implies OEM encoding as the system encoding of Win32 console 
applications, or the use of multiple codepages, as before. But IMO Win32 
console also implements a "W" interface, so that it's up to the user to 
use whatever is more appropriate for his code.

The RTL has to distinguish between system-wide "filesystem" and "GUI" 
encoding, in file handling (CreateFile...).

>>>> Why spend time in the design of multiple RTL/LCL versions, when 
>>>> a single version will be perfectly sufficient?
>>> Why spent 13 years being compatible when you can throw it away in a
>>> second?
>> It's sufficient to throw away what's no more needed :-)
> 
> The previous message from Jeff shows that even shortstring is still in major
> production use. Nothing is unused and can be clipped without a long winded
> transition, or Delphi 2009 like painful breaks.

It's all about the well known dilemma:
- force (possibly many) implicit conversions, or
- supply multiple RTL/LCL versions, or
- break legacy user code by moving to a different (but again unique) 
string type.

> Moreover, these discussions are useless since you know as well as I do that
> no one stringtype will ever satisfy everybody. So IMHO it is time to take
> the consequences from the 500 posts on this subject on the unicode subject
> on this and other FPC/Lazarus lists and start thinking in solutions to
> manage that, instead of reiterating the "one type to rule them all" mantra
> ad infinitum.

The discussion is only about the pros and cons of the various possible 
solutions. I.e. it should reveal the critical cases and consequences, 
that have to be considered and handled in every implementation.

The implementation can choose any model. Different models can be 
implemented as well, so that the final decision about the new standard 
can be delayed, until the models can be tested in real world applications.

One model has already been implemented: UTF-8. It may need some 
adds/improvements, like a *hard* separation of AnsiString from 
UTF8String, and nothing has to be thrown away.

DoDi