[fpc-devel] Unicode and UTF8String
Martin Friebe
fpc at mfriebe.de
Mon Dec 1 16:27:24 CET 2008
Marco van de Voort wrote:
> In our previous episode, Martin Friebe said:
>
>> Of course they are still there, to be used in the few parts of your
>> code, where you specialize on whatever string type you deal with.
>> But otherwise, using RTLString IMHO will abandon this part of pascal
>> syntax.
>>
> It removes ASCII legacy. I don't see you complaining about the fact that
> char is not 8 bit anymore, and that that abandons that part of the pascal
> syntax.
>
It does not abandon the syntax. It only adds to it's meaning (*adds*,
any existing meaning is unaltered.).
I can still do: foo[1] for *any* type of string. (well yes even
RTLstring, but see below)
- If string happens to be an old ascii string, that still works as it
always has
- If string happens to be any unicode => that is still the same syntax,
but with a new meaning.
The new meaning doe snot break anything, because it only applies to
new types.
It is usable too, because I know, I am dealing with codepoints, or sub
code points. And I know how they look, and how to identify them
The introduction of RTLString is fine. I do say it is a good thing.
RTLString does not interfere with the above. In fact even for RTLstring
the syntax foo[1] does exist. Just it is not useful. If I tread it as
utf8 sub code point, I can be wrong. If I tread it as ascii, I can be
wrong. If I tread it as UTF16 I can be wrong
My argument was not against RTLString. However it was my understanding
that RTL functions will "enforce" RTLString. That they will only exist
for RTLString, and they will *not* exist for other string types.
That I would call enforcing RTLString, because of penalties on using
other string types.
I acknowledge, that if the end result of calling the RTL function, is an
OS call, the conversation/penalty is always there. But not every RTL
function ends up in an OS call.
>> I admit that the Problem started (and that has been discussed more than
>> enough) starts with UTF8string (yes even with utf16 string). But in this
>> case those functions became a new, but predictable meaning. I can do
>> utf8string[1], and I can use the result. Only I have to be aware what it
>> means.
>>
>
> Yes. As widestring[1] also requires interpretation. That's unicode.
>
See above: Yes it requires interpretation. But it allows me to do so
I can not see how I can interpret RtlString[1]. If the result is bigger
than 128, then I must know what type it is. If it is ANSI, it is a
single byte char. If it is utf8, it is a sub-codepoint which will be
part of a codepoint.
If it is widestring, well yes, here breaks my assumption that
RtlString[1] returns a byte.... ouch
>
>
>> I can *not* do rtlString[1], as at the time of code writing I can not be
>> aware what it means.
>>
>
> You don't have to. You carry it around as long as you can, and when you
> don't can, you assign it to your type of choice and bite the penalty.
>
As I said in another mail. Every programmer starts as a beginner. And
for many of those this is the last thing to think about.
Best Regards
Martin
More information about the fpc-devel
mailing list