[fpc-devel] Unicode and UTF8String

Mon Dec 1 16:36:23 CET 2008

Martin Friebe schrieb:
> Marco van de Voort wrote:
>> In our previous episode, Martin Friebe said:
>>  
>>> Of course they are still there, to be used in the few parts of your
>>> code, where you specialize on whatever string type you deal with.
>>> But otherwise, using  RTLString  IMHO will abandon this part of
>>> pascal syntax.
>>>     
>> It removes ASCII legacy. I don't see you complaining about the fact that
>> char is not 8 bit anymore, and that that abandons that part of the pascal
>> syntax.
>>   
> It does not abandon the syntax.  It only adds to it's meaning (*adds*,
> any existing meaning is unaltered.).
> 
> I can still do:  foo[1]  for *any* type of string. (well yes even
> RTLstring, but see below)
> - If string happens to be an old ascii string, that still works as it
> always has
> - If string happens to be any unicode => that is still the same syntax,
> but with a new meaning.
>  The new meaning doe snot break anything, because it only applies to new
> types.
>  It is usable too, because I know, I am dealing with codepoints, or sub
> code points. And I know how they look, and how to identify them
> 
> The introduction of RTLString is fine. I do say it is a good thing.
> RTLString does not interfere with the above. In fact even for RTLstring
> the syntax  foo[1]  does exist. Just it is not useful. If I tread it as
> utf8 sub code point, I can be wrong. If I tread it as ascii, I can be
> wrong. If I tread it as UTF16 I can be wrong
> 
> My argument was not against RTLString. However it was my understanding
> that RTL functions will "enforce" RTLString. That they will only exist
> for RTLString, and they will *not* exist for other string types.
> That I would call enforcing RTLString, because of penalties on using
> other string types.
> 
> I acknowledge, that if the end result of calling the RTL function, is an
> OS call, the conversation/penalty is always there. But not every RTL
> function ends up in an OS call.
> 
>>> I admit that the Problem started (and that has been discussed more
>>> than enough) starts with UTF8string (yes even with utf16 string). But
>>> in this case those functions became a new, but predictable meaning. I
>>> can do utf8string[1], and I can use the result. Only I have to be
>>> aware what it means.
>>>     
>>
>> Yes. As widestring[1] also requires interpretation. That's unicode.
>>   
> See above: Yes it requires interpretation. But it allows me to do so
> 
> I can not see how I can interpret RtlString[1]. If the result is bigger
> than 128, then I must know what type it is. If it is ANSI, it is a
> single byte char. If it is utf8, it is a sub-codepoint which will be
> part of a codepoint.
> If it is widestring, well yes, here breaks my assumption that
> RtlString[1] returns a byte.... ouch
> 

I see this as a theoretic consideration. Please give a real world (!)
code example when this causes a problem.

If you assign the result of an rtl function to an rtlstring, this means
you don't care about the type of rtlstring[1] or the knowledge, that
it's type is rtlchar is enough for you. If you assign it to an
ansistring/widestring whatever, you know what you get.