[fpc-devel] Unicode and UTF8String

Mon Dec 1 16:27:24 CET 2008

Marco van de Voort wrote:
> In our previous episode, Martin Friebe said:
>   
>> Of course they are still there, to be used in the few parts of your 
>> code, where you specialize on whatever string type you deal with.
>> But otherwise, using  RTLString  IMHO will abandon this part of pascal 
>> syntax.
>>     
> It removes ASCII legacy. I don't see you complaining about the fact that
> char is not 8 bit anymore, and that that abandons that part of the pascal
> syntax.
>   
It does not abandon the syntax.  It only adds to it's meaning (*adds*, 
any existing meaning is unaltered.).

I can still do:  foo[1]  for *any* type of string. (well yes even 
RTLstring, but see below)
- If string happens to be an old ascii string, that still works as it 
always has
- If string happens to be any unicode => that is still the same syntax, 
but with a new meaning.
  The new meaning doe snot break anything, because it only applies to 
new types.
  It is usable too, because I know, I am dealing with codepoints, or sub 
code points. And I know how they look, and how to identify them

The introduction of RTLString is fine. I do say it is a good thing. 
RTLString does not interfere with the above. In fact even for RTLstring 
the syntax  foo[1]  does exist. Just it is not useful. If I tread it as 
utf8 sub code point, I can be wrong. If I tread it as ascii, I can be 
wrong. If I tread it as UTF16 I can be wrong

My argument was not against RTLString. However it was my understanding 
that RTL functions will "enforce" RTLString. That they will only exist 
for RTLString, and they will *not* exist for other string types.
That I would call enforcing RTLString, because of penalties on using 
other string types.

I acknowledge, that if the end result of calling the RTL function, is an 
OS call, the conversation/penalty is always there. But not every RTL 
function ends up in an OS call.

>> I admit that the Problem started (and that has been discussed more than 
>> enough) starts with UTF8string (yes even with utf16 string). But in this 
>> case those functions became a new, but predictable meaning. I can do 
>> utf8string[1], and I can use the result. Only I have to be aware what it 
>> means.
>>     
>
> Yes. As widestring[1] also requires interpretation. That's unicode.
>   
See above: Yes it requires interpretation. But it allows me to do so

I can not see how I can interpret RtlString[1]. If the result is bigger 
than 128, then I must know what type it is. If it is ANSI, it is a 
single byte char. If it is utf8, it is a sub-codepoint which will be 
part of a codepoint.
If it is widestring, well yes, here breaks my assumption that 
RtlString[1] returns a byte.... ouch

>  
>   
>> I can *not* do rtlString[1], as at the time of code writing I can not be 
>> aware what it means.
>>     
>
> You don't have to. You carry it around as long as you can, and when you
> don't can, you assign it to your type of choice and bite the penalty.
>   
As I said in another mail. Every programmer starts as a beginner. And 
for many of those this is the last thing to think about.

Best Regards
Martin