[fpc-devel] utf8 in 2.6.0

Sat Jan 5 14:25:33 CET 2013

Martin Schreiber schrieb:

> but I fear we can not use that information for development with Free Pascal 
> because:
> "
> The string is represented internally as a Unicode string encoded as UTF-16. 
> Characters in the Basic Multilingual Plane (BMP) take 2 bytes, and characters 
> not in the BMP require 4 bytes.
> "
> and
> "
> A control string is a sequence of one or more control characters, each of 
> which consists of the # symbol followed by an unsigned integer constant from 
> 0 to 65,535 (decimal) or from $0 to $FFFF (hexadecimal) in UTF-16 encoding, 
> and denotes the character corresponding to a specified code value. Each 
> integer is represented internally by 2 bytes in the string. This is useful 
> for representing control characters and multibyte characters.
> "
> which seems to be different from Free Pascal.

Where do you see a difference? The strings are stored in UTF-16, which 
is the same in every implementation, regardless of (possibly) different 
more verbose descriptions.

The new AnsiStrings are safe against misinterpretation, because they 
contain their encoding (codepage). Every char in an AnsiString now can 
be converted to one and only one Unicode char, when needed. This is not 
true for single AnsiChars, which still have no codepage information 
stored with them (in both Delphi and FPC). I strongly discourage the use 
of Char variables in all flavours (Char, AnsiChar, WideChar), because 
these are incapable of holding all possible Unicode characters. Only 
UnicodeChar or UCS4Char (if these exist) can hold all possible character 
codes, without possible codepage misinterpreation.

The discussion mostly covers the compilation of string *literals*, like 
'äöü' or #123, for which every compiler tries to find the best 
interpretation and internal representation. FPC has a $codepage 
directive, which tells the compiler that *all* literals in this unit 
shall be treated as strings of that codepage. This is essential for 
files stored as Ansi, which have no information about the codepage of 
the contained single-byte characters. Files stored with UTF-8 encoding, 
and an UTF-8 BOM at their begin, are safe against misinterpretation.

When the compiler translates the source code string literals, it can 
store them either as Unicode (UTF-16) or as AnsiString of the given 
$codepage, depending on the *use* of the literal (type of the string 
variable in an assignment). This will reduce the number of implicit 
string conversions at runtime.

[Please correct me if I'm wrong]
DoDi