[fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tony Whyman tony.whyman at mccallumwhyman.com
Thu Sep 5 10:28:44 CEST 2019


Apologies: when I typed "FTP" below I meant "FPC" :( I'm currently 
drowning in acronym soup.

On 05/09/2019 09:24, Tony Whyman wrote:
>
> A few points:
>
> 1. IMHO: This is currently a Windows problem where the console buffer 
> is UCS2. Linux (and probably all other cases its UTF8 - to be verified).
>
> 2. The following Microsoft blog post is interesting background on 
> where MS are going with this:
>
> https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/
>
> 3. The current Windows API includes "SetConsoleCP" which should (I 
> haven't tested this) allow you to set transliteration to UTF-8 when 
> you call the Windows ReadConsoleInput API function. This seems to 
> imply that FTP can be a consistent UTF8 environment even when the 
> Windows Console buffer is UCS2.
>
> 4. Because console input is buffered, you probably cannot have a 
> situation where readln changes the console code page to fit the type 
> (unicode or ansistring) of the variable that you are reading into.
>
> 5. You could change FTP so that under Windows, the console is always 
> read using UCS2 with transliteration to ansistring happening when 
> required and depending on the type of the variable that you are 
> reading into. I think that is probably what you are asking for under 
> Windows:
>
> - The console code page is always UCS2.
>
> - Console input is read into unicodestrings in native mode
>
> - Console input is read into ansistrings with transliteration from 
> UCS2 after the input buffer has been parsed.
>
> - Conversion to integers, floats, etc. occurs after transliteration to 
> ansistring in order to avoid too many changes to the RTL.
>
> - Under other OSs, Console input is UTF8 (or a supported ANSI code 
> page). Transliteration to unicodestrings occurs after parsing the 
> input buffer.
>
> 6. The question is: is it worth having a different approach to Windows 
> when Windows allows you to set the console input buffer to UTF8 and 
> hence have a common input environment for all OSs?
>
> On 05/09/2019 08:00, LacaK wrote:
>> Is there consensus/demand on such solution and any patch in this 
>> direction will be accepted?
>> If yes we must agree on implementation details and IMO also someone 
>> must check what situation is in Delphi ... because I guess, that if 
>> Delphi does not support this that also FPC will not diverge?
>> Question1: should be supported "SetTextCodePage(CP_UTF16)" and 
>> "SetTextCodePage(CP_UTF16BE)"?
>> Question2: is this supported in Delphi?
>> If answer to both questions is YES then I will fill bug report as 
>> start point.
>
> _______________________________________________
> fpc-pascal maillist  -  fpc-pascal at lists.freepascal.org
> https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20190905/a9dcb330/attachment.html>


More information about the fpc-pascal mailing list