[fpc-devel] UTF-8 string literals

Fri May 5 14:57:01 CEST 2017

On Fri, 5 May 2017 14:30:32 +0200 (CEST)
Michael Van Canneyt <michael at freepascal.org> wrote:

>[...]
> > AFAIK FPC stores UTF-8 string literals (-Fcutf8) as widestrings
> > instead of UTF8String. Please correct me if I'm wrong.
> >
> > This has several side effects:
> >
> > 1. When using a character outside BMP FPC stops with:
> > Error: UTF-8 code greater than 65535 found
> > For example:
> > const Eyes = '👀';
> >
> > 2. Assigning a UTF-8 literal to an UTF8String requires a
> > widestringmanager.
> > For example non ISO-8859-1 chars are mangled:
> > var u: UTF8String = 'äöüالعَرَبِيَّة';  
> 
> I assume you mean UTF-16 literal ?

Huh? The codepage is utf-8, the string type is utf-8, FPC stores UCS-2,
why do you ask about UTF-16?

> > 3. PChar on a string literal does not work as expected. You get the
> > bytes of a widestring instead.  
> 
> You should weigh the advantages you outline here against the disadvantages of
> no longer knowing how string literals will be encoded.

At the moment string literals are encoded in two different ways
depending on codepage, character values, literal format and probably
some more attributes I don't know. That often confuses users. IMO it
would be less confusing if matching string type and codepage would work
without conversion.

> It means e.g. the resource string tables will have entries that are UTF16 encoded
> or entries that are UTF8 encoded, depending on the unit they come from. 
> This is highly undesirable.

Ehm, the compiled-in resourcestring tables are AnsiString.
AFAIK you need the UTF-8 system codepage to use the full UTF-16
capabilities of the rsj files.

> By forcing everything UTF16 we ensure delphi compatibility (yes it does matter) 
> and we also ensure a uniform set of string tables.

It will be a glory day, when this is accomplished. 
But some people can't wait that long.

Mattias