[fpc-devel] fcl-xml
Sergei Gorelkin
sergei_gorelkin at mail.ru
Mon Mar 23 14:01:42 CET 2009
Marco van de Voort wrote:
> (maillist maintainer/jonas: I wrote a similar message from a non-subscribed
> email addr. It can be discarded, sorry)
>
> I needed a html parser, and am not in a hurry, so I decided to check FPC's
> own first, in the hope that I can at least make some documentation in the
> wiki /examples during the experience.
>
> The first project is simple, see program below, executed on FPC's html
> documentation. I noticed that it failed like this:
>
> An unhandled exception occurred at $004284EC :
> EDOMError : EDOMError in DOMDocument.CreateElement hr/0
> $004284EC
> $00411A86 THTMLTODOMCONVERTER__READERSTARTELEMENT, line 500 of
> src/sax_html.pp
> $0042648A TSAXREADER__DOSTARTELEMENT, line 738 of src/sax.pp
> $004110DC THTMLREADER__ENTERNEWSCANNERCONTEXT, line 391 of
> src/sax_html.pp
> $00410C80 THTMLREADER__PARSE, line 358 of src/sax_html.pp
> $0042612C TSAXREADER__PARSESTREAM, line 647 of src/sax.pp
> $00411F3D READHTMLFILE, line 609 of src/sax_html.pp
> $00411E91 READHTMLFILE, line 593 of src/sax_html.pp
> $004015DE main, line 21 of saxattempt.dpr
>
> Some debugging seems that it fails on <hr/>, doctype of the doc in question
> is
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd">
>
> Some questions for the more xmlable:
> 1. is this correct? I think <hr/> is more xml notation than html notation?
For html this is not correct, but that file might happen to be xhtml. In
general, FPC's xml parser is much more developed than html parser,
therefore many FPC tools actually write xhtml.
> 2. can I somehow convince (override) DOM to accept it? (since modifying the
> generator (tex4ht) might prove to be difficult). It could be genera
I think it would be better to fix sax_html.pp either to handle this
condition gracefully (strip '/'), or raise a exception. If it raises an
exception, that exception could contain location information you need.
> 3. Is there a way to have line numbers in the exceptions? Modifying the
> source with writeln's to find out which tag exactly goes wrong is a bit
> ugly.
>
The exceptions generated by parser contain this information (for xml,
this is EXMLReadError.Line, EXMLReadError.LinePos). sax_html seems not
to generate exceptions at all :(
The exceptions raised from DOM methods (like CreateElement) do not have
location information because these methods are primarily intended for
building DOM tree from code, when there is no source file.
> Note that I'm already happy with pointers where to start. Anybody willing to
> share private examples or documentation would be great too.
>
> program saxattempt;
>
> {$mode delphi}
>
> Uses Sax_HTML,sysutils,classes,dom_html;
>
> var d:TSearchRec;
> sx : THTMLDocument;
> Htmls: TStringList;
> begin
> htmls:=TStringList.create;
> if findfirst('*.html',faanyfile,d)=0 then
> begin
> repeat
> writeln(d.name);
> sx:=THtmlDocument.create;
> ReadHtmlFile(sx,d.name);
> htmls.addobject(d.name,sx);
> until findnext(d)<>0;
> findclose(d);
> end;
> end.
>
Regards,
Sergei
More information about the fpc-devel
mailing list