[fpc-devel] utf8 in 2.6.0
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Sat Jan 5 14:25:33 CET 2013
Martin Schreiber schrieb:
> but I fear we can not use that information for development with Free Pascal
> because:
> "
> The string is represented internally as a Unicode string encoded as UTF-16.
> Characters in the Basic Multilingual Plane (BMP) take 2 bytes, and characters
> not in the BMP require 4 bytes.
> "
> and
> "
> A control string is a sequence of one or more control characters, each of
> which consists of the # symbol followed by an unsigned integer constant from
> 0 to 65,535 (decimal) or from $0 to $FFFF (hexadecimal) in UTF-16 encoding,
> and denotes the character corresponding to a specified code value. Each
> integer is represented internally by 2 bytes in the string. This is useful
> for representing control characters and multibyte characters.
> "
> which seems to be different from Free Pascal.
Where do you see a difference? The strings are stored in UTF-16, which
is the same in every implementation, regardless of (possibly) different
more verbose descriptions.
The new AnsiStrings are safe against misinterpretation, because they
contain their encoding (codepage). Every char in an AnsiString now can
be converted to one and only one Unicode char, when needed. This is not
true for single AnsiChars, which still have no codepage information
stored with them (in both Delphi and FPC). I strongly discourage the use
of Char variables in all flavours (Char, AnsiChar, WideChar), because
these are incapable of holding all possible Unicode characters. Only
UnicodeChar or UCS4Char (if these exist) can hold all possible character
codes, without possible codepage misinterpreation.
The discussion mostly covers the compilation of string *literals*, like
'äöü' or #123, for which every compiler tries to find the best
interpretation and internal representation. FPC has a $codepage
directive, which tells the compiler that *all* literals in this unit
shall be treated as strings of that codepage. This is essential for
files stored as Ansi, which have no information about the codepage of
the contained single-byte characters. Files stored with UTF-8 encoding,
and an UTF-8 BOM at their begin, are safe against misinterpretation.
When the compiler translates the source code string literals, it can
store them either as Unicode (UTF-16) or as AnsiString of the given
$codepage, depending on the *use* of the literal (type of the string
variable in an assignment). This will reduce the number of implicit
string conversions at runtime.
[Please correct me if I'm wrong]
DoDi
More information about the fpc-devel
mailing list