[fpc-pascal] XML DOM and HTML

Thu Jun 12 22:45:11 CEST 2008

Johannes Nohl wrote:
> Dear list, dear Michael!
> 
>> There are multiple problems with HTML parsing: HTML is not a well-formed
>> XML document, because
>> - the tags are case insensitive (in XML they are case sensitive)
>> - Not all tags must be closed.
>> If the HTML is XHTML, then the DOM unit can be used to parse it.
> 
> But how do I retrieve more than the first part of the node's value?
> 
> If I read in:
>  <div>
>   asdf1
>   <span>qwer1</span>
>   asdf2
>   <img src="" />
>   asdf3
>  </div>
> 
> FindNode('dvi').NodeValue returns "asdf1". But not asdf2 and asdf3.
> Isn't the example above valid XHTML?
> 

If were going to parse web pages I would probably opt to use RegEx.  There is 
regex included with fpc I believe, but I tend to use this one since its 
compatible with fpc and delphi:

http://regexpstudio.com/TRegExpr/TRegExpr.html

-- 

Warm Regards,

Lee