[fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Tomas Hajny XHajT03 at hajny.biz
Fri Sep 6 15:22:36 CEST 2019


On 2019-09-06 07:24, LacaK wrote:
> From user POV we have this situation:
> - on one side there is input text file encoded UTF-16 (either LE or BE)
> - on other side there is FPC, where RTL procedures like AssignFile,
> SetTextCodePage, Reset, Read(Ln), Write(Ln) are available.
> 
> My original intention was simply use call to existing procedure
> SetTextCodePage with parameter CP_UTF16, which in my opinion will
> simply signal to RTL, that input/output text file is/should be encoded
> using UTF16.

Yes, I believe that extending SetTextCodePage with supporting UTF-16 
makes sense (with certain caveats like that calling it should be 
performed before Rewrite in case of new files creation, or otherwise the 
BOM mark will not be added to the beginning of the file). The other 
question is what needs to happen within the text file record - as 
mentioned in my other post, I'd prefer adding a new field specifying the 
codepoint size rather than having to check for specific codepage values 
in all code branches which would need to be created for handling the 
difference.

Moreover, the case of opening a file is somewhat trickier, because the 
file may have the encoding specified within the file itself. Would we 
add code for reading the first bytes every time Reset is called for a 
text file not associated with another device (console) and set the 
fields in the text file record (possibly overriding an explicit setting 
from SetTextCodePage)? Personally, I'd do so, but others may have a 
different opinion.


> Then any subsequent call to ReadLn with any destination variable
> (ansistring, unicodestring, integer, etc.) will simply do something
> like:
> - read from file byte sequence, which will be interpreted as UTF-16 so
> we will have on input UnicodeString

Just a comment - if already adding this support, we should IMHO allow 
UTF-32 as well.


> - this UnicodeString will be further transliterated to requested
> destination variable (as there are in FPC implicit conversions between
> UnicodeString and AnsiString this would be no problem)

Yes.


> (for Write(Ln) same will happen only in reverse order: source variable
> -> UnicodeString -> Write to File)
> 
> If SetTextCodePage(CP_UTF16) is not appropriate, then we must IMO
> introduce any new procedure which will give to user possibility signal
> that "I have UTF-16 encoded text file" or "I want that all writes to
> my text file should be encoded UTF-16".
> (but personally I do not see reason to introduce new procedure as
> SetTetCodePage for me perfectly fit)

See above - a new procedure may not be needed, but I'd prefer a new text 
file record field in the background for better efficiency and 
maintainability.


> So firstly we need design/proposal, which is/will be accepted.
> (probably here is needed deeper knowledge of RTL internals so it is
> reason why also others core developers should step in)

Right. See my input above for my current thoughts. In the end, we should 
preferably extend the FPC Unicode handling page in the Wiki; in the 
meantime, a new page may be used for documenting the specification. 
Before doing that, I'd still want to hear the opinion from Jonas, Marco 
and Michael - I'll ask them.

Tomas


More information about the fpc-pascal mailing list