[fpc-devel] String handling in trunk (was utf8 in 2.6.0)

Mon Jan 7 17:56:25 CET 2013

Once upon a time, on 01/07/2013 05:05 PM to be precise, Tomas Hajny said:
> On Mon, January 7, 2013 14:19, Michael Schnell wrote:
>> On 01/07/2013 02:01 PM, Tomas Hajny wrote:
>>> (also just my understanding of what Jonas wrote)
>> I feel you are wrong. The string does not know about the code it's
>> content is to be interpreted in (other than with Delphi XE).
> Sorry, your way of quoting makes it difficult for others to react.
>
> I freely admit that I may be wrong, but I don't understand what you meant
> with your comment and thus I don't understand in what way you I am wrong
> in your view. The compiler obviously knows how the constant is used within
> the source code and thus it may proceed accordingly (i.e. either convert
> it to some 8-bit encoding at compile time if UTF-16 code constant appears
> in the source, or keep it in UTF-16 if assigned to a UnicodeString
> constant).
Yep, the compiler does know how the constant is used and how it is
defined (how else could it generate working code?), but I don't see how
it could do something with it if it is assigned to another type of
string (by type I mean `one-byte versus two-byte`). The compiler can't
know for sure what you mean, it can do at least these things:
  - Copy data without translating, so a one char two-byte string becomes
a two char one-byte string; a three char one-byte string would become a
three char two byte string; and then there is a pardox: should a
three-char two-byte string become a six-char one-byte string? ==> this
is probably not how it is done
  - Translate the meanings of the characters of the string, but here the
compiler needs to know in what encoding they are and in what encoding
the string is wanted. (which it doesn't I believe; the $codepage
directive is only used for the encoding of the characters in the unit
intself) ==> I think this also isn't a a possibility
  - Copy the data byte per byte, but then a one-byte string containing
an uneven amount of chars needs padding + there are issues with
endianness here ==> Not really an option no?
  - Truncate every value of a two-byte string to convert it two a one
byte string; the other way around would put each character of the
one-byte string as one in the two-byte string ==> Solves the first
paradox, but introduces loss of data

==> All the above options (except the translation, that is) ignore the
escape charachter(s) of the string, so you wont get the data you want.

IMO I don't think it (typecasting a one-byte string to a two-byte
string) can be done without human intervention. Look at it this way:
typecasting a thread handle to an integer makes no sense either:
  - They are both related (a thread handle is definitely a number, even
if it is a pointer)
  - But putting one in the other makes no sense at all: what does
`comparing whether a thread id is less than zero` mean? on the other
hand `comparing whether an integer is less than zero` has a distinct
meaning.
  - The sizes may be different (say an integer of 16 bit long and a
thread handle of 64 bit long), how do you put one in the other? Sum the
bytes together? Multiply them? Take the 16 bit CRC of the handle?

This is IMO the same with a one-byte char and a two byte char:
 - They both represent letters/words/...
 - But they are not the same and cannot be typecasted without extra
knowlegde.

This last point is also valid for my example above: you could put all
thread ids you know of in a lookup-table and put the index in that
lookup-table in the 16-bit integer. Fixed. Same goes for our strings: if
you know one is UTF-8 and you want to convert it to UTF-16 it can be
done without error, but without this extra knowledge it can't give you
decisive results.

Just a few points I think bear some potential to contemplate over a cup
of $c0ffee ;-)

-- 
Ewald