[fpc-devel] assign constant text to widestring

Thu Oct 23 09:46:48 CEST 2008

>
> AFAIK the compiler reads the source as non-utf8 (latin or some 8 bit 
> encoding). This leads to other things too, like identifiers cannot 
> contain utf8.
This was discussed in the German Lazarus Forum. Here I got a funny 
result: when I right-click the Lazarus-Code-Editor I see that the file 
coding is set to utf8. There is no option for UCS2 or UFT16. But when I 
take a look at the source code file with a hex editor I see that it is a 
word-coding and starts with a BOM $FFFE, which (AFAIK) means UCS2-coded 
file.
>
> The String within the quotes is a byte sequence to the compiler. And 
> the compiler does not know it to be utf8. From your description I take 
> it the compiler does translate those 3 "8bit chars" into some 16bit 
> chars (correctness of this translation based on the 8bit source 
> encoding is another question)
Should the compiler not take care of this automatically, thus creating 
the same result independently of the code the source file is saved in 
(of course regarding the current code page it it's ANSI instead of one 
of the Unicode flavors (UTF8, UCS2, UTF16, UCS4).
>
> Lazarus uses UTF8 for everything, it will save your string as utf8. If 
> Your string was kept as ansistring, the compiler would treat it as 
> bytes, and pass it through, so any code wanting to see the utf8 would 
> be fine.
(Disregarding the coding of the source file) I do know what happens and 
why the result I see is produced by the compiler, but I feel that it is 
highly undesirable that the compiler is not aware of the type UTF8String 
and thus can't do a decent conversion when assigning it form and to a 
WideString.
>
> You can try and tell Lazarus to save you file as latin1. As long as 
> all you strings fit into latin1, this may work; IF and only if the 
> compiler will translate the latin1 into correct Widechars.
I'll try this, but this does not solve the underlying problem. And of 
course it is highly desirable to use Lazarus in a decent Unicode mode.
>
> It will not work for anything not in utf8. AFAIK Lazarus currently 
> doesn't save in ucs2 (or any 16 bit encoding). 
It does in my test. But nobody seems to know how/why. Nonetheless, the 
compiler should be aware of the coding the source is stored in (it does 
have a correct BOM in my test) and act appropriately.
> But even if Lazarus did, since the compiler wants 8bit encoding, your 
> whole source would be broken.
In my test the compiler does compile the UCS2 coded file correctly 
(seemingly as designed: the text constants get an "internal" type 
UTF8String, thus they are somehow converted UCS2->UTF8). Funny: the lpr 
file is stored as ASCII (might of course be either of ANSI, Latin1 or UTF8).
>
> Not much help, I know. Maybe some one else does have more ideas / 
> knowledge.
I in fact don't need help, but want to discuss what I think is as real 
problem - if non an error.

Inserting the string type conversions manually makes my test project 
work just fine. But this is a real PITA, WideString Projects ported from 
Delphi don't work, and new users who try to use WideStrings (IMHO _the_ 
decent method to do Unicode programming) will be taken aback as the of 
course do expect that using WideStrings in their code would "just work".

-Michael