[fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file
XHajT03 at hajny.biz
Fri Sep 6 15:22:36 CEST 2019
On 2019-09-06 07:24, LacaK wrote:
> From user POV we have this situation:
> - on one side there is input text file encoded UTF-16 (either LE or BE)
> - on other side there is FPC, where RTL procedures like AssignFile,
> SetTextCodePage, Reset, Read(Ln), Write(Ln) are available.
> My original intention was simply use call to existing procedure
> SetTextCodePage with parameter CP_UTF16, which in my opinion will
> simply signal to RTL, that input/output text file is/should be encoded
> using UTF16.
Yes, I believe that extending SetTextCodePage with supporting UTF-16
makes sense (with certain caveats like that calling it should be
performed before Rewrite in case of new files creation, or otherwise the
BOM mark will not be added to the beginning of the file). The other
question is what needs to happen within the text file record - as
mentioned in my other post, I'd prefer adding a new field specifying the
codepoint size rather than having to check for specific codepage values
in all code branches which would need to be created for handling the
Moreover, the case of opening a file is somewhat trickier, because the
file may have the encoding specified within the file itself. Would we
add code for reading the first bytes every time Reset is called for a
text file not associated with another device (console) and set the
fields in the text file record (possibly overriding an explicit setting
from SetTextCodePage)? Personally, I'd do so, but others may have a
> Then any subsequent call to ReadLn with any destination variable
> (ansistring, unicodestring, integer, etc.) will simply do something
> - read from file byte sequence, which will be interpreted as UTF-16 so
> we will have on input UnicodeString
Just a comment - if already adding this support, we should IMHO allow
UTF-32 as well.
> - this UnicodeString will be further transliterated to requested
> destination variable (as there are in FPC implicit conversions between
> UnicodeString and AnsiString this would be no problem)
> (for Write(Ln) same will happen only in reverse order: source variable
> -> UnicodeString -> Write to File)
> If SetTextCodePage(CP_UTF16) is not appropriate, then we must IMO
> introduce any new procedure which will give to user possibility signal
> that "I have UTF-16 encoded text file" or "I want that all writes to
> my text file should be encoded UTF-16".
> (but personally I do not see reason to introduce new procedure as
> SetTetCodePage for me perfectly fit)
See above - a new procedure may not be needed, but I'd prefer a new text
file record field in the background for better efficiency and
> So firstly we need design/proposal, which is/will be accepted.
> (probably here is needed deeper knowledge of RTL internals so it is
> reason why also others core developers should step in)
Right. See my input above for my current thoughts. In the end, we should
preferably extend the FPC Unicode handling page in the Wiki; in the
meantime, a new page may be used for documenting the specification.
Before doing that, I'd still want to hear the opinion from Jonas, Marco
and Michael - I'll ask them.
More information about the fpc-pascal