[fpc-devel] String handling in trunk (was utf8 in 2.6.0)

Mon Jan 7 22:17:30 CET 2013

On Mon, Jan 7, 2013 at 6:05 PM, Mark Morgan Lloyd <
markMLl.fpc-devel at telemetry.co.uk> wrote:

> Tomas Hajny wrote:
>
>> On Mon, January 7, 2013 13:28, Ewald wrote:
>>
>>> Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell
>>> said:
>>>
>>>> On 01/05/2013 12:28 PM, Jonas Maebe wrote:
>>>>
>>>>> Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8
>>>>> encoding of that character.
>>>>>
>>>> Sorry, I can't follow. Does #xx not just define a numerical
>>>> representation of an 8 bit entity ?
>>>>
>>>> The interpretation in any code might be done later by any code that
>>>> digests the string.
>>>>
>>>> Am I wrong ?
>>>>
>>> I *think* Jonas is trying to say that if you want the character `Ǿ` in a
>>> string you would either type
>>> - 'Ǿ' or
>>> - #$C7#$BE if you want to keep the source free of encoding specific
>>> characters
>>>
>>  .
>>  .
>>
>> ...or
>> - #$01FE and then the whole string becomes a Unicode string which is
>> either kept that way (if it is assigned to a UnicodeString constant), or
>> it is converted to some 8-bit encoding at compile time (if it is assigned
>> to an 8-bit constant/variable like ansistring)
>>
>> (also just my understanding of what Jonas wrote)
>>
>
> That's how I read it as well. In which case, is #A3 16-bit Unicode
> (representing the UK £ Sterling) or malformed UTF-8 (should be #c2#a3)?
>

The way I understand it is that #A3 will be effected by $codepage directive
of source file. So, if programmer correctly sets $codepage to match
encoding used in editor (be it utf8 or some other encoding), compiler will
also 'understand' that string correctly.

If programmer never uses UnicodeString, and always uses codepage which was
used to write source code, everything will work fine - #A3 will stay
whatever it is in specific encoding.

On the other hand, if there comes situation in which string containing #A3
needs to be converted to UnicodeString, compiler will either: a) convert it
correctly to UnicodeString if encoding used is utf8, or b) call
system-specific function to convert string to array of WideChar-s (in which
case, correctness of the program depends on support for specific encoding
on tharget system).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-devel/attachments/20130107/dead6a20/attachment.html>