[fpc-devel] fpdoc and unicode characters

Sergei Gorelkin sergei_gorelkin at mail.ru
Thu Aug 14 14:24:32 CEST 2008


Graeme Geldenhuys wrote:
> On Thu, Aug 14, 2008 at 1:14 PM, Marco van de Voort <marcov at stack.nl> wrote:
>>> How does this argument fit with XML which also uses UTF-8 as the de
>>> facto standard encoding. And seeing that fpdoc uses XML for the
>>> documentation files, can I use the actual Unicode characters in my
>>> fpdoc documentation, or must I still stick with the?what now seems to
>>> be outdated?escaped method?
>> Depends. Is & a steering character in all of XML, or only the xhtml like
>> standards?
> 
> I think only XHTML.
> 
XML too. In XML, you *must* escape ampersand (U+0026) and less-than sign 
(U+003C). Also greater-than sign (U+003E) must be escaped if it is 
preceded by ']]' sequence. Additionally, in attribute values,  quotes 
(U+0022) must be escaped if they are used as value delimiters (other 
option is to delimit values with apostrophes (U+0027)).
Here I mean the XML file, not the DOM tree. You may freely use the 
mentioned characters in plaintext while manupulating DOM; the writer 
will escape them on output.

> But what is fpdoc's xml files?  Pure XML, XHTML or some custom/hybrid
> format? The layout of fpdoc's files seem XML, but the documentation
> content seems some hybrid HTML - hence the confusion with what is
> allowed!
> 
XHTML is XML with defined 'vocabulary' (DTD). These formats have no 
character-level differences.

> Anybody know the rules of strict XML files and Unicode?  Can I use
> Unicode characters as data in XML nodes? I would imagine I may because
> most well-formed XML files specify UTF-8 as the encoding type.
> 
> Also something I think has been resolved in recent versions, but in
> older 'makeskel' versions, it did not include the encoding in the
> generated .xml file.  So what are we supposed to treat such files
> encoding as? Default to W3C standards and use assume UTF-8?  LCL and
> fpGUI's fpdoc documentation (mostly) has no encoding specified in the
> .xml files.  FPC's documentation specifies ISO8859-1 as the encoding
> type, though I found one file (dateutils.xml) it FPC docs that hasn't
> got an encoding (but my doc update is out of date).
> 
W3C demands that XML file without encoding label should be treated as 
UTF-8 (unless it has an UTF-16 BOM, in which case it should be treated 
as UTF-16). Therefore UTF-8 labeling is optional.
In older times, makeskel used to write 'ISO8859-1' label, which btw is 
invalid (IANA recognized names are ISO-8859-1 and ISO_8859-1). Later, 
when the parser got more compliant, the labeling was removed. The parser 
has a workaround to understand the ISO8859-1 labeling.
The XML writer always produces UTF-8 encoding and writes no label.

To summarize: Unicode can be used in fpdoc xml files. If the file has 
ISO8859-1 encoding label, it should be removed or replaced with UTF-8 
label. The output stages of fpdoc may or may not have problems with 
Unicode - that requires additional research.

Sergei




More information about the fpc-devel mailing list