[fpc-pascal] Re: Widestrings length and character iteration

Wed May 9 00:23:46 CEST 2007

Daniël Mantione wrote:
 >
 > Op Mon, 7 May 2007, schreef Christos Chryssochoidis:
 >
 >> Daniël Mantione wrote:
 >>> Not possible, a widestring is UCS-2/UTF-16.
 >>  I defined a widestring with 7 characters (code points), and the 
length()
 >> function returned the value 15. Of the 7 code points of that 
widestring only
 >> one of them was greater than $07FF (the maximum code point which can be
 >> encoded in 2 bytes under UTF-8). When I changed that character with 
another
 >> one with code not greater than $07FF, length() returned value 14... 
I also
 >> printed the byte values of one of the widestring's widechars, and 
the values
 >> printed indicated UTF-8 encoding.
 >
 > Yes, the program output is utf-8 on OS-X, because this is the native
 > encoding for OS-X. However, widestrings are not utf-8. Can you show your
 > code?
 >
 > Daniël
 >
 >
 > ------------------------------------------------------------------------
 >
 > _______________________________________________
 > fpc-pascal maillist  - 
fpc-pascal-PD4FTy7X32k2wBtHl531yWD2FQJk+8+b at public.gmane.org
 > http://lists.freepascal.org/mailman/listinfo/fpc-pascal

OK, I figured out what happened. The source file was saved in UTF-8 
encoding, but I hadn't put in my source file the compiler directive 
{$CODEPAGE UTF8}. After including this directive in my code almost 
everything worked fine: length() was returning the right number of 
unicode characters, and subscripting the widestring returned the right 
character. But as the widechar and widestring encoding is, as you said, 
UTF-16, while my Mac OS X console uses UTF-8 encoding, for the output 
results to be displayed correctly I had to wrap the individual widechars 
or the whole widestring with the function utf8encode(), prior to output 
them with write()...

Thanks for your help,

Christos