[fpc-devel] Unicode and UTF8String
florian at freepascal.org
Mon Dec 1 16:36:23 CET 2008
Martin Friebe schrieb:
> Marco van de Voort wrote:
>> In our previous episode, Martin Friebe said:
>>> Of course they are still there, to be used in the few parts of your
>>> code, where you specialize on whatever string type you deal with.
>>> But otherwise, using RTLString IMHO will abandon this part of
>>> pascal syntax.
>> It removes ASCII legacy. I don't see you complaining about the fact that
>> char is not 8 bit anymore, and that that abandons that part of the pascal
> It does not abandon the syntax. It only adds to it's meaning (*adds*,
> any existing meaning is unaltered.).
> I can still do: foo for *any* type of string. (well yes even
> RTLstring, but see below)
> - If string happens to be an old ascii string, that still works as it
> always has
> - If string happens to be any unicode => that is still the same syntax,
> but with a new meaning.
> The new meaning doe snot break anything, because it only applies to new
> It is usable too, because I know, I am dealing with codepoints, or sub
> code points. And I know how they look, and how to identify them
> The introduction of RTLString is fine. I do say it is a good thing.
> RTLString does not interfere with the above. In fact even for RTLstring
> the syntax foo does exist. Just it is not useful. If I tread it as
> utf8 sub code point, I can be wrong. If I tread it as ascii, I can be
> wrong. If I tread it as UTF16 I can be wrong
> My argument was not against RTLString. However it was my understanding
> that RTL functions will "enforce" RTLString. That they will only exist
> for RTLString, and they will *not* exist for other string types.
> That I would call enforcing RTLString, because of penalties on using
> other string types.
> I acknowledge, that if the end result of calling the RTL function, is an
> OS call, the conversation/penalty is always there. But not every RTL
> function ends up in an OS call.
>>> I admit that the Problem started (and that has been discussed more
>>> than enough) starts with UTF8string (yes even with utf16 string). But
>>> in this case those functions became a new, but predictable meaning. I
>>> can do utf8string, and I can use the result. Only I have to be
>>> aware what it means.
>> Yes. As widestring also requires interpretation. That's unicode.
> See above: Yes it requires interpretation. But it allows me to do so
> I can not see how I can interpret RtlString. If the result is bigger
> than 128, then I must know what type it is. If it is ANSI, it is a
> single byte char. If it is utf8, it is a sub-codepoint which will be
> part of a codepoint.
> If it is widestring, well yes, here breaks my assumption that
> RtlString returns a byte.... ouch
I see this as a theoretic consideration. Please give a real world (!)
code example when this causes a problem.
If you assign the result of an rtl function to an rtlstring, this means
you don't care about the type of rtlstring or the knowledge, that
it's type is rtlchar is enough for you. If you assign it to an
ansistring/widestring whatever, you know what you get.
More information about the fpc-devel