[fpc-devel] UTF-8 string literals
Florian Klaempfl
florian at freepascal.org
Sun May 7 10:27:58 CEST 2017
Am 05.05.2017 um 13:53 schrieb Mattias Gaertner:
> Hi,
>
> AFAIK FPC stores UTF-8 string literals (-Fcutf8)
-Fc tells the compiler only the encoding of the source code page, it
says nothing how string constant shall be encoded.
> as widestrings
> instead of UTF8String. Please correct me if I'm wrong.
>
> This has several side effects:
>
> 1. When using a character outside BMP FPC stops with:
> Error: UTF-8 code greater than 65535 found
> For example:
> const Eyes = '👀';
>
> 2. Assigning a UTF-8 literal to an UTF8String requires a
> widestringmanager.
> For example non ISO-8859-1 chars are mangled:
> var u: UTF8String = 'äöüالعَرَبِيَّة';
>
> 3. PChar on a string literal does not work as expected. You get the
> bytes of a widestring instead.
Well, it depends on what you expect :)
>
>
> What would happen if FPC would be extended to store UTF-8
> literals as UTF8String?
> What are the disadvantages?
1. Backward compatibility. Due to its windows origins and history, the
default unicode encoding in FPC is UTF-16, FPC uses also internally
UTF-16 everywhere.
2. What would happen then the other way around? When casting the string
constant to a PUnicodeChar (what probably a lot of delphi code does)?
3. Personally, I still think, UTF-16 is the "native" unicode type: all
important APIs use UTF-16, for me, UTF-8 is a hack.
What we could do of course is, that if a constant is assigned to a
string with explicit utf-8 encoding, that the compiler does the
conversion at run time. But it complicates things even more. This does
not solve the PChar problem, but I think, when somebody uses unicode
source files and PChar, he is on how own :)
I think, it would nice if Michael (v. C.) prepares some section for the
docs and we comment and help him to improve it.
More information about the fpc-devel
mailing list