[fpc-pascal] Console Encoding in Windows (Local VS. UTF8)
shiruba at galapagossoftware.com
Mon Jul 29 07:36:01 CEST 2013
2013/7/9 Michael Schnell <mschnell at lumino.de>
> On 07/09/2013 11:02 AM, Noah Silva wrote:
>> I convert it to UTF8 before displaying it....
>> Not a good idea.
> Well if the console is UTF8....
> The FPC developers are right now busy implementing the new Delphi Strings.
> This _could_ mean that the application programmer can use any encoding
> (such as multiple different ANSI byte-codes, UTF-8, UTF-16, ...), but in
> fact to be 100% Delphi compatible ("nothing less, nothing more"), it seems
> that only UTF-16 will gain full decent support (e.g. class inheritance, in
> TStringList, the Lazarus user API etc.)
> Using UTF16 for internal string handling is a sensible option. That's
what the Windows API does and what f.e. SAP's ABAP does. On the other
hand, UTF8 is very common in files and transferring string data via things
like HTTP/XML, so it has to be fully supported any way around it. OS X
uses UTF8 as the "local" encoding (so you never have to worry there, except
in Java), and apparently so does GTK2. To make it simple, UTF8 can be used
everywhere that ANSI encodings were used, because it is ASCII compatible
when only ASCII is used. UTF16 and UTF32 can't easily be substituted
because they contain "padding" bytes for normal ASCII.
For things like WideString, this is fine. The reason UTF16/UTF32 are
popular for in-memory variables is that it's easy to achieve higher
performance. For example, with UTF32, there is no need to "decode" the
string to find out what character you are on, how many bytes a certain
character takes up, etc. If you want the 4th character of a string, you
simply go to the 4th 4 byte array element and retrieve the value. UTF16
works the same way if you are dealing with the 99% of characters in use
that take only two bytes (but this leads to bugs because people usually
don't handle the remaining 1% properly). So you gain processing speed and
code simplicity by using UTF16 or UTF32. You lose out on memory if you are
dealing with ASCII data only - which is no big deal in most cases. UTF8
saves memory and is more ASCII compatible, but requires more
decoding/encoding. Since they represent the same character set, it doesn't
really matter in the end - there is a trade-off either way. If you are
doing mainly I/O, UTF8 is convenient, if you are doing heavy duty string
processing, UTF32 is convenient. Either way, not supporting one or the
other is simply not an option if you want to be able to write Unicode
compliant programs. One can be the "main" way used by internal routines,
and this is UTF16 in more operating systems than not.
If FPC used UTF8 for everything and automatically converted it, then calls
to the Win32 API would be slowed down by this, so it makes sense to use
UTF16 on Windows, but... then again if GTK2 requires UTF8 then you have the
same (but opposite) problem there. Lazarus also doesn't support only
Windows, so we have to think a little wider than Delphi.
Another interesting point is that I have heard no end of complaints about
Delphi's Unicode strategy, so while we want to be compatible, perhaps we
should consider how to do that while possibly avoiding some of the same
To address your specific points:
1.Lazarus User API already supports UTF8 so far as I know.
2. TStringList could easily support both, but as long as the conversion
to/from other code pages (especially UTF8) is automatic, I wouldn't mind.
3. Not sure what class inheritance has to do with this...
p.s.: Unicode is an area that I know a lot about, so if anyone working on
the RTL needs help testing, let me know...
> fpc-pascal maillist - fpc-pascal at lists.freepascal.**org<fpc-pascal at lists.freepascal.org>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the fpc-pascal