[fpc-devel] utf8 in 2.6.0

Martin Schreiber mse00000 at gmail.com
Sat Jan 5 10:29:38 CET 2013


On Tuesday 01 January 2013 18:36:01 Martin Schreiber wrote:
>
> So #n or #nn or #nnn or #nnnn or #nnnnn always means Unicode codepoint and
> will be at compiletime converted to an 8bit character sequence depending on
> {$codepage} and stored in a cpstrnew with the codepage of {$codepage} if
> assigned to a cpstrnew variable?
> And if the constant is assigned to a UnicodeString variable the Unicode
> codepoints are converted and stored to a utf-16 16bit character sequence at
> compiletime independent if they contain codepoints > 255?

Hans-Peter Diettrich wrote:
>> That string contains codepoints > #255 and hence is a unicodestring
>> rather than a single byte string. No conversion at either compile or
>> run time happens, and the codepage directive has no influence.
> Does this really mean that, when the codes > #255 are removed, the
> remaining codes have a different meaning?

I'm confused. Are these stupid questions? Or should we not touch the theme 
strings and FPC anymore? But how should we be able to build serious programs 
with FPC then?

> Has somebody a link to Embarcadero documentation about the matter? I assume
> FPC trunk does exactly the same as Delphi XE3 with strings?
>
I found this:
http://docwiki.embarcadero.com/RADStudio/XE3/en/String_Types
http://docwiki.embarcadero.com/RADStudio/XE3/en/Internal_Data_Formats
http://docwiki.embarcadero.com/RADStudio/XE3/en/Fundamental_Syntactic_Elements

but I fear we can not use that information for development with Free Pascal 
because:
"
The string is represented internally as a Unicode string encoded as UTF-16. 
Characters in the Basic Multilingual Plane (BMP) take 2 bytes, and characters 
not in the BMP require 4 bytes.
"
and
"
A control string is a sequence of one or more control characters, each of 
which consists of the # symbol followed by an unsigned integer constant from 
0 to 65,535 (decimal) or from $0 to $FFFF (hexadecimal) in UTF-16 encoding, 
and denotes the character corresponding to a specified code value. Each 
integer is represented internally by 2 bytes in the string. This is useful 
for representing control characters and multibyte characters.
"
which seems to be different from Free Pascal.

Martin



More information about the fpc-devel mailing list