[fpc-devel] Understanding literal strings

Florian Klaempfl florian at freepascal.org
Fri Dec 14 15:58:25 CET 2007


Felipe Monteiro de Carvalho schrieb:
> Hello,
> 
> I am doing some tests and trying to find out what exactly happens with
> string constants. Suppose this source code:
> 
> var
>   AString: ansistring;
> begin
>   AString := 'A acentuação é necessária no Português';
> end;
> 
> The source is encoded in utf-8, but the compiler isn't told about that
> in any way:
> * There is no BOM marker
> * no {$codepage } directive
> 
> I took a look at compier/scanner.pas and I see that clearly if we have
> {$codepage utf-8} the literal string will be stored as UTF-16 (using
> the charset map in compiler/cp8859_1.pas) in the executable and
> converted back at run-time if assigned to ansistring, using
> Wide2AnsiMoveProc.
> 
> And what it that doesn't happen? Will it also be encoded in UTF-16?Or
> will it just be left alone as a group of bytes?
> 
> Jonas once said that if there is nothing saying which encoding the
> source is, it will be treated as ISO 8859-1 and encoded as UTF-16, and

This applies only if you assign it to a widestring or include chars with
ord>255 like #555 or using utf-8. The compiler takes the given code page
only to do from/to widestring conversions, no more no less. If no
widestring/no char>255 is envolved, nothing happens at compile time. The
compiler can't know the encoding of ansistrings at the target system so
it's obvious that it shouldn't mess with ansistring constants except if
they are assigned to widestrings which are encoding independent.

The BOM/code page directive simply allows (and nothing else!) you to
create widestrings using simple string constants even if you want to use
chinese or whatever characters. Just tell the compiler what enconding to
use to convert the string constant to widestring.

> then at run-time be converted when assigned to a ansistring to the
> current locale.
> 
> But Paul has a russian windows and he made a lcl application with
> utf-8 text and it run ok, althougth I would suppose it shouldn't work,

Why?

> because the following conversions would occur:
> 
> iso 8859-1 ---> utf-16 ---> locate (maybe iso 8859-5???)

When? It works correct because:
- the source code uses the same locale for ansistrings as the host
system (UTF-8)
- the compiler doesn't mess with the encoding because no widestrings are
envolved

> 
> And as LCL expects utf-8 it will only work if the first and last
> locales on this double transformation are equal. Or if there is no
> conversion taking place.
> 

Conversion is only done at compile time if string constants are assigned
to widestrings everything else would mean that the compiler is guessing.



More information about the fpc-devel mailing list