[fpc-devel] fcl-xml

Sergei Gorelkin sergei_gorelkin at mail.ru
Mon Mar 23 14:01:42 CET 2009


Marco van de Voort wrote:
> (maillist maintainer/jonas: I wrote a similar message from a non-subscribed
> email addr. It can be discarded, sorry)
> 
> I needed a html parser, and am not in a hurry, so I decided to check FPC's
> own first, in the hope that I can at least make some documentation in the
> wiki /examples during the experience.
> 
> The first project is simple, see program below, executed on FPC's html
> documentation.  I noticed that it failed like this:
> 
> An unhandled exception occurred at $004284EC :
> EDOMError : EDOMError in DOMDocument.CreateElement hr/0
>   $004284EC
>   $00411A86  THTMLTODOMCONVERTER__READERSTARTELEMENT,  line 500 of
>   src/sax_html.pp
>   $0042648A  TSAXREADER__DOSTARTELEMENT,  line 738 of src/sax.pp
>   $004110DC  THTMLREADER__ENTERNEWSCANNERCONTEXT,  line 391 of
>   src/sax_html.pp
>   $00410C80  THTMLREADER__PARSE,  line 358 of src/sax_html.pp
>   $0042612C  TSAXREADER__PARSESTREAM,  line 647 of src/sax.pp
>   $00411F3D  READHTMLFILE,  line 609 of src/sax_html.pp
>   $00411E91  READHTMLFILE,  line 593 of src/sax_html.pp
>   $004015DE  main,  line 21 of saxattempt.dpr
> 
> Some debugging seems that it fails on <hr/>, doctype of the doc in question
> is
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd">
> 
> Some questions for the more xmlable:
> 1. is this correct? I think <hr/> is more xml notation than html notation?

For html this is not correct, but that file might happen to be xhtml. In 
general, FPC's xml parser is much more developed than html parser, 
therefore many FPC tools actually write xhtml.

> 2. can I somehow convince (override) DOM to accept it? (since modifying the
> generator (tex4ht) might prove to be difficult). It could be genera

I think it would be better to fix sax_html.pp either to handle this 
condition gracefully (strip '/'), or raise a exception. If it raises an 
exception, that exception could contain location information you need.

> 3. Is there a way to have line numbers in the exceptions? Modifying the
> source with writeln's to find out which tag exactly goes wrong is a bit
> ugly.
> 
The exceptions generated by parser contain this information (for xml, 
this is EXMLReadError.Line, EXMLReadError.LinePos). sax_html seems not 
to generate exceptions at all :(
The exceptions raised from DOM methods (like CreateElement) do not have 
location information because these methods are primarily intended for 
building DOM tree from code, when there is no source file.

> Note that I'm already happy with pointers where to start. Anybody willing to
> share private examples or documentation would be great too.
> 
> program saxattempt;
> 
> {$mode delphi}
> 
> Uses Sax_HTML,sysutils,classes,dom_html;
> 
> var d:TSearchRec;
>     sx : THTMLDocument;
>     Htmls: TStringList;
> begin
>   htmls:=TStringList.create;
>   if findfirst('*.html',faanyfile,d)=0 then
>     begin
>       repeat
>         writeln(d.name);
>         sx:=THtmlDocument.create;
>         ReadHtmlFile(sx,d.name);
>         htmls.addobject(d.name,sx);
>       until findnext(d)<>0;
>       findclose(d);
>     end;
> end.
>
Regards,
Sergei




More information about the fpc-devel mailing list