[fpc-pascal] XML DOM and HTML
Lee Jenkins
lee at datatrakpos.com
Thu Jun 12 22:45:11 CEST 2008
Johannes Nohl wrote:
> Dear list, dear Michael!
>
>> There are multiple problems with HTML parsing: HTML is not a well-formed
>> XML document, because
>> - the tags are case insensitive (in XML they are case sensitive)
>> - Not all tags must be closed.
>> If the HTML is XHTML, then the DOM unit can be used to parse it.
>
> But how do I retrieve more than the first part of the node's value?
>
> If I read in:
> <div>
> asdf1
> <span>qwer1</span>
> asdf2
> <img src="" />
> asdf3
> </div>
>
> FindNode('dvi').NodeValue returns "asdf1". But not asdf2 and asdf3.
> Isn't the example above valid XHTML?
>
If were going to parse web pages I would probably opt to use RegEx. There is
regex included with fpc I believe, but I tend to use this one since its
compatible with fpc and delphi:
http://regexpstudio.com/TRegExpr/TRegExpr.html
--
Warm Regards,
Lee
More information about the fpc-pascal
mailing list