[fpc-devel] Re: new 27 page document describing Unicode support in D2009
Luiz Americo Pereira Camara
pascalive at bol.com.br
Fri Nov 21 23:16:27 CET 2008
Graeme Geldenhuys escreveu:
> On Fri, Nov 21, 2008 at 11:08 PM, Graeme Geldenhuys
> <graemeg.lists at gmail.com> wrote:
>
>> I thought you guys might find this interesting. It's a new 27 page
>> document describing Unicode support in D2009.
>>
>> http://dn.codegear.com/article/38980
>>
>
> Seeing that I don't own D2009 and only read about it's Unicode support
> I found some of the information interesting - and it was things we
> argued about in this mailing list.
>
> For example:
>
> 1...
> Length() returns the bytes for UTF8String
> but Length() returns the elements (what we know as characters) for
> String or UTF16 strings.
>
No Length for String will return the number of Code Units (the number of
WideChar in UnicodeString case). When there's surrogate pairs it will
differ the number of Code Points (Characters) and Code Units. See the
excerpt:
"
A way to create a string with surrogate pairs is to use the
ConvertFromUtf32 function that
returns a string with the surrogate pair (two WideChar) in the proper
circumstances, like the
following:
var
str1: string;
begin
str1 := 'Surr. ' + ConvertFromUtf32($1D11E);
Now if you ask for the string length, you'll get 8, which is the number
of WideChar, but not the
number of logical Unicode code points in the string. If you print the
string you get the proper
effect (well, at least Windows will generally show one square block as
placeholder of the
surrogate pair, rather than two).
"
> Length() also returns bytes for AnsiString.
>
> --------------------
> var
> str8: Utf8String;
> str16: string;
> begin
> str8 := 'Cantù';
> Memo1.Lines.Add ('UTF-8');
> Memo1.Lines.Add('Length: ' + IntToStr (Length (str8)));
> Memo1.Lines.Add('5: ' + IntToStr (Ord (str8[5])));
> Memo1.Lines.Add('6: ' + IntToStr (Ord (str8[6])));
> str16 := str8;
> Memo1.Lines.Add ('UTF-16');
> Memo1.Lines.Add('Length: ' + IntToStr (Length (str16)));
> Memo1.Lines.Add('5: ' + IntToStr (Ord (str16[5])));
> As you might expect, the str8 string has a length of 6 (meaning 6
> bytes), while the str16
> string has a length of 5 (meaning 10 bytes, though). Notice that
> Length invariably returns the
> number of string elements, which in case of variable-length
> representations don't match the
> number of Unicode code points represented by the string. This is the
> output of the program:
> UTF-8
> Length: 6
> 5: 195
> 6: 185
> UTF-16
> Length: 5
> 5: 249
>
> --------------------
>
> 2... TStrings can now take an encoding parameter to specify how it
> should load or save files.
>
> -----------------------------
> STREAMING TSTRINGS
> The ReadFromFile and WriteToFile methods of the TStrings class can be
> called with
> an encoding. If you write a string list to text file without providing
> a specific encoding, the class
> will use TEncoding.Default, which uses the internal DefaultEncoding in turn
> extracted at the first occurrence by the current Windows code page. In
> other words, if you save
> a file you'll get the same ANSI file as before.
> Of course, you can also easily force the file to a different format,
> for example the UTF-16 format:
>
> Memo1.Lines.SaveToFile('test.txt', TEncoding.Unicode);
> -----------------------------
>
>
> anyway, there are a lot more interesting facts in this document. Well
> worth reading to get a better understanding of unicode.
>
>
> Regards,
> - Graeme -
>
>
> _______________________________________________
> fpGUI - a cross-platform Free Pascal GUI toolkit
> http://opensoft.homeip.net/fpgui/
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> fpc-devel maillist - fpc-devel at lists.freepascal.org
> http://lists.freepascal.org/mailman/listinfo/fpc-devel
>
More information about the fpc-devel
mailing list