[fpc-devel] simple UTF tests

Marco van de Voort marcov at stack.nl
Thu Jan 5 12:32:45 CET 2012


In our previous episode, Michael Schnell said:
> With Lazarus on Linux, I did some simple tests with UTF strings.
> 
> I found that the length of an "AnsiString(CP_UTF16)" is given in terms 
> of bytes and not of Words. Is this like it should ?

Yes. Afaik that is not a sane combination, but Delphi compatible. 
 
> I found that pchar(s8) with an UTF-8 string works as expected, giving a 
> pointer to the UTF-8 encoded byte array.
> Anyway: is it obvious, that the encoding of pchar is UTF-8 ? Is this 
> portable ?

pchar should give access to the raw data of the default string type. (be it
still 8-bit as in FPC, or 16-bit in Delphi). 
 
> p16 = pchar(s16) with an UTF-16 gives a pointer to the first byte of the 
> word array, so (with ASCII text), the second byte is zero, thus a 
> C-String length 1. Is this like it should ?

Yes. This is not sane code (even if you want e.g. the lower byte, this is
not endian safe), since s16 is currently not the default string type

> Of course re-assigning p16 to an UTF-16 string does not reproduce the 
> original string.
> What encoding is to be supposed for a pchar ?

pchar's provide access to memory with the granularity
of the default string type. Whatever that is, 8-bit or 16-bit, and in
whatever encoding it is stored.

When converted to something else, the default system encoding for the
corresponding default string is probably used.

To force 8 or 16 bits one should use pansichar or pwidechar. This is Delphi
compatible. 

> The Debugger does not show UTF-16-Strings correctly (it shows the same 
> result as pchar() ). Is this just a Lazarus problem, or does FPC need to 
> provide additional support for this ?

No idea. Both are possible.



More information about the fpc-devel mailing list