[fpc-pascal] Re: Ido not understand UTF8 in Windows

Tomas Hajny XHajT03 at mbox.vol.cz
Sat Feb 20 14:35:54 CET 2010


On Sat, February 20, 2010 01:15, JoshyFun wrote:
> Hello Tomas,
>
> Friday, February 19, 2010, 11:55:39 PM, you wrote:
>
> TH> No, this can't work that way, otherwise output of any accented
> TH> character in one of the Windows codepages would result in the same
> TH> error.
>
> Tested the "wrong" return of stdout:
>
> code page UTF8 - 65001 en Windows
> Length of string: 7
> camión -> Returned written: 6
>
> Source code:
> -------------------------------------
> uses classes,windows;
> var
>  s: ansistring;
>  OutputStream: TStream;
> Begin
>  Writeln('code page UTF8 - 65001 en Windows');
>  OutputStream := THandleStream.Create(GetStdHandle(STD_OUTPUT_HANDLE));
>  s:='cami'+#$C3+#$B3+'n'; //camión
>  writeln('Length of string: ',Length(s));
>  writeln(' -> Returned written: ',OutputStream.write(s[1],Length(s)));
>  OutputStream.free;
> End.

OK, this seems to be the problem. The underlying Win32 API (WriteFile) is
requested to write 7 bytes to a file. However those 7 bytes correspond to
only 6 characters in UTF-8, and the Win32 API (apparently) returns the
number of written _characters_ rather than the number of written _bytes_.
The Windows implementation of do_write (which is an internal wrapper
around the platform specific API for writing to a file) currently assumes
that the returned number is again number of bytes (equally to the provided
parameter), which is OK for simple single byte codepages, but not OK for
UTF-8, and it returns this number without any changes. The System routine
for file I/O compares the number of bytes requested to be written to the
number returned as actually written and they do not match, it is
interpreted as an I/O error.

Please, post a bug report about this. I guess that fixing it may require
little bit more thinking. One simple way to fix it would be just changing
the Windows implementation of do_write so that it only checks for an error
value returned by WriteFile and if no error is indicated, the original
length of buffer is returned regardless of the value returned by
WriteFile. However, the information about the actually written
_characters_ may be useful in certain cases, so I'm not sure if it isn't
better to preserve it somehow and possibly extend implementation for other
platforms to also get this value.

Tomas





More information about the fpc-pascal mailing list