[fpc-pascal] XML DOM and HTML

Michael Van Canneyt michael at freepascal.org
Sun Jun 8 18:06:52 CEST 2008



On Sun, 8 Jun 2008, Johannes Nohl wrote:

> Dear list, dear Michael!
> 
> > There are multiple problems with HTML parsing: HTML is not a well-formed
> > XML document, because
> > - the tags are case insensitive (in XML they are case sensitive)
> > - Not all tags must be closed.
> > If the HTML is XHTML, then the DOM unit can be used to parse it.
> 
> But how do I retrieve more than the first part of the node's value?
> 
> If I read in:
>  <div>
>   asdf1
>   <span>qwer1</span>
>   asdf2
>   <img src="" />
>   asdf3
>  </div>
> 
> FindNode('dvi').NodeValue returns "asdf1". But not asdf2 and asdf3.
> Isn't the example above valid XHTML?

In the above, the node value is badly defined for the div node. 
The return value is IMHO correct. You will have to 'glue' the various text parts together.


Michael.



More information about the fpc-pascal mailing list