[fpc-devel] UTF-8 string literals

Sun May 7 10:27:58 CEST 2017

Am 05.05.2017 um 13:53 schrieb Mattias Gaertner:
> Hi,
> 
> AFAIK FPC stores UTF-8 string literals (-Fcutf8) 

-Fc tells the compiler only the encoding of the source code page, it
says nothing how string constant shall be encoded.

> as widestrings
> instead of UTF8String. Please correct me if I'm wrong.
> 
> This has several side effects:
> 
> 1. When using a character outside BMP FPC stops with:
> Error: UTF-8 code greater than 65535 found
> For example:
> const Eyes = '👀';
> 
> 2. Assigning a UTF-8 literal to an UTF8String requires a
> widestringmanager.
> For example non ISO-8859-1 chars are mangled:
> var u: UTF8String = 'äöüالعَرَبِيَّة';
> 
> 3. PChar on a string literal does not work as expected. You get the
> bytes of a widestring instead.

Well, it depends on what you expect :)

> 
> 
> What would happen if FPC would be extended to store UTF-8
> literals as UTF8String? 
> What are the disadvantages?

1. Backward compatibility. Due to its windows origins and history, the
default unicode encoding in FPC is UTF-16, FPC uses also internally
UTF-16 everywhere.

2. What would happen then the other way around? When casting the string
constant to a PUnicodeChar (what probably a lot of delphi code does)?

3. Personally, I still think, UTF-16 is the "native" unicode type: all
important APIs use UTF-16, for me, UTF-8 is a hack.

What we could do of course is, that if a constant is assigned to a
string with explicit utf-8 encoding, that the compiler does the
conversion at run time. But it complicates things even more. This does
not solve the PChar problem, but I think, when somebody uses unicode
source files and PChar, he is on how own :)

I think, it would nice if Michael (v. C.) prepares some section for the
docs and we comment and help him to improve it.