[fpc-pascal] Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

Mon Sep 16 00:20:22 CEST 2019

On 2019-09-06 15:22, Tomas Hajny wrote:
> On 2019-09-06 07:24, LacaK wrote:

Hi *,

As promised, I discussed the idea of adding support for UTF-16 encoded 
text files (and preferably UTF-32 as well while at it) to the RTL with 
other core team members. Overall, I didn't come across anybody oposing 
this idea, the only (logical) requirement is taking care of the 
performance implications of this change, i.e. avoiding considerable 
performance decrease in processing of 8-bit encoded files (actually, 
this is one of reasons of my suggestion to add codepoint size 
information to the text file record and use that instead of checking 
individual values of the codepage variable to find out the codepoint 
size implications every time working with the file - see below).

> Yes, I believe that extending SetTextCodePage with supporting UTF-16
> makes sense (with certain caveats like that calling it should be
> performed before Rewrite in case of new files creation, or otherwise
> the BOM mark will not be added to the beginning of the file). The
> other question is what needs to happen within the text file record -
> as mentioned in my other post, I'd prefer adding a new field
> specifying the codepoint size rather than having to check for specific
> codepage values in all code branches which would need to be created
> for handling the difference.
> 
> Moreover, the case of opening a file is somewhat trickier, because the
> file may have the encoding specified within the file itself. Would we
> add code for reading the first bytes every time Reset is called for a
> text file not associated with another device (console) and set the
> fields in the text file record (possibly overriding an explicit
> setting from SetTextCodePage)? Personally, I'd do so, but others may
> have a different opinion.
  .
  .

After the discussion with some people from the core team, I suggest the 
following:

1) New attribute for the codepoint size will be added to the text file 
record and all the text file I/O needs to be checked and possibly 
extended to with using this attribute instead of current implicit 
expectation that the codepoint size is always 1 byte.

2) Support for UTF-16BE/LE and UTF-32BE/LE will be added to 
SetTextCodePage, the new codepoint size attribute will be updated as 
appropriate.

3) New function 'DetectUtfBom (var T: text): boolean' will be added. 
This function may be called after the call to 'Reset (T: text)' to check 
for existence of BOM at the beginning of the text file. If it is found 
(Result=true), SetTextCodePage is invoked automatically from 
DetectUtfBom with the codepage value corresponding to the found BOM and 
encoding variant. If BOM is not found (Result=false), nothing changes.

4) A new procedure 'SetUtfBom (var T: text; CodePage: word; BOM: 
boolean)' will be added. This procedure may be called after the call to 
Rewrite and allows writing BOM to the respective text file. 
SetTextCodePage with the respective value will be called from SetUtfBom.

Comments, anybody?

Tomas