[fpc-pascal] XML DOM and HTML

Sebastian Günther sguenther at gmx.de
Sat Jun 21 04:06:01 CEST 2008


Johannes Nohl schrieb:
> Dear list,
> 
> I player around with the units dom and xmlread. I liked them very
> much. Now I thought I could parse websites with it. But they are
> slightly different as far as I know. In xml everthing is within a node
> while in HTML there are more then one value in a node. E.g.:
> 
> possible XML:
> 
> <div>
>  asdf1
>  <span>qwer1</span>
>  <span>qwer2</span>
> </div>
> 
> HTML:
> <div>
>  asdf1
>  <span>qwer1</span>
>  asdf2
>  <span>qwer2</span>
>  asdf3
> </div>
> 
> Using XML-Dom I can access Value "asdf1" only. I think second example
> is not valid XML, or?
> 
> Has anybody used XML to parse HTML-files? Is there a unit?


Yes.
HTML is based on SGML, and XML is a subset of SGML. So you cannot simply 
parse any HTML file using a XML parser.
You can try to use the HTML parser (but which relies on more or less 
correct HTML code) in packages/fpc-xml/sax_html.pp instead of the XML 
parser, which should be able to parse most of all websites.


Regards,
Sebastian



More information about the fpc-pascal mailing list