[fpc-devel] XML Components

Michael Van Canneyt michael at freepascal.org
Fri Nov 2 14:08:58 CET 2012



On Fri, 2 Nov 2012, Andrew Brunner wrote:

>> As a consequence, the codepage in the XML must be checked and converted if need be.
>>
> The input data in the example attached is converted.

There is no attachment to your mail.

>
>
>> Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. The resulting DOM tree would be unusable.
>>
>
> True.
>
>
>>> Any help or feedback is entirely welcome and needed.  This data in currently in at least 1 stream and failing my cloud desktop sync application.
>>
>> You'll have to write your own XML handling routines which work only with the codepage the XML is in. And be prepared that they will fail as soon as the encoding of the XML changes.
>>
>
> Right.  But converting the data to say UTF8 should have worked.  I have explicitly set the encoding to UTF8 in the header.

Without looking at the data and the errors you get, it's impossible to say anything useful.

>
>
>>>
>>> I would really love an option to disable XML byte for byte checking during parsing.
>
> I think it would be a good solution and even prove faster in controlled environments.  Plus all data is stored as widestrings in the DOM.
>
> The first question I have is if there was such an option would the patch be accepted.

I don't see how you can fix the problem. If the input is UTF8, and the result must be converted 
to a widestring for the DOM, then a conversion MUST take place, there is no way to avoid it.
And a conversion means scanning the input byte for byte.

In each case, the input must be scanned byte for byte anyway, to detect all the tags. 
That's what makes XML slow and unusable for large amount of data.

> The next question is what is the problem with the uf8 routine that it left the offending byte sequence intact without converting the bytes in my sample data?

Without error message, it is impossible to tell.

Michael.



More information about the fpc-devel mailing list