[fpc-devel] Understanding literal strings

Felipe Monteiro de Carvalho felipemonteiro.carvalho at gmail.com
Fri Dec 14 15:37:47 CET 2007


Hello,

I am doing some tests and trying to find out what exactly happens with
string constants. Suppose this source code:

var
  AString: ansistring;
begin
  AString := 'A acentuação é necessária no Português';
end;

The source is encoded in utf-8, but the compiler isn't told about that
in any way:
* There is no BOM marker
* no {$codepage } directive

I took a look at compier/scanner.pas and I see that clearly if we have
{$codepage utf-8} the literal string will be stored as UTF-16 (using
the charset map in compiler/cp8859_1.pas) in the executable and
converted back at run-time if assigned to ansistring, using
Wide2AnsiMoveProc.

And what it that doesn't happen? Will it also be encoded in UTF-16?Or
will it just be left alone as a group of bytes?

Jonas once said that if there is nothing saying which encoding the
source is, it will be treated as ISO 8859-1 and encoded as UTF-16, and
then at run-time be converted when assigned to a ansistring to the
current locale.

But Paul has a russian windows and he made a lcl application with
utf-8 text and it run ok, althougth I would suppose it shouldn't work,
because the following conversions would occur:

iso 8859-1 ---> utf-16 ---> locate (maybe iso 8859-5???)

And as LCL expects utf-8 it will only work if the first and last
locales on this double transformation are equal. Or if there is no
conversion taking place.

Also, what exactly defines how string literals should be handled? Is
that compiler specific or are we following delphi, or how is the
behavior defined?

thanks,
-- 
Felipe Monteiro de Carvalho



More information about the fpc-devel mailing list