[fpc-pascal] Re: stripping HTML
Ralf Junker
ralfjunker at gmx.de
Sun Apr 17 15:46:15 CEST 2011
HTML is not meant to be handled on a line-by-line basis as other
text-based formats. According to the specs, HTML is not line-based.
Browsers should display the following two HTML snippets identically:
<p#13#10>one#13#10two</p>
and
<p>one#13#10two</p#13#10>
With HTML tags removed both result to:
one two
As such, a line-based text/markup ratio does not make much sense IMHO,
especially since browsers do strip line breaks in most text elements
except within <pre> ... </pre>.
That said, I believe that DIHtmlParser should care for most of your needs:
http://yunqa.de/delphi/doku.php/products/htmlparser/index
DIHtmlParser meets most of your requirements:
* Not DOM based, very fast.
* Hand-crafted, linear-scan Unicode HTML parser.
* Handles SCRIPTs and STYLEs well.
* Simple "Extract Text" demo included, may be modified as needed.
Drawbacks:
* Like HTML, DIHtmlParser is not line-based. An option is available
to strip or preserve line breaks and white space.
* Pre-compiled units available for Delphi only. The source code is
required to compile with FreePascal.
Ralf
On 17.04.2011 14:08, Roland Schäfer wrote:
> I feel I have to justify myself: I always do extensive web and list
> archive searches before posting to a list (hence the infrequency of my
> posts). I had actually found that snippet over a week ago but
> immediately discarded it since it is obviously a toy solution. I have a
> much better solution already using the PCRE library on a text stream,
> sometimes re-reading portions of the stream by way of backtracking. The
> problems with any approach like that (esp. 6-liners like the one linked
> in your post, but also more elaborate buts still makeshift regular
> expression magic) are:
>
> 1. They don't handle faulty HTML well enough.
>
> 2. They don't handle any multi-line constructs like comments or scripts.
> Depending on how naively you read the input (e.g., using
> TStringList.ReadFromFile), they even choke on simple tags with all sorts
> of line breaks in between, which are frequently found (and which are, to
> my knowledge, not even ill-formed). What do you do with this (for a start)?
>
> '<div class="al#13#10#13ert">'
>
> 3. They are potentially not the most efficient solution, which is an
> important factor if the stripping alone takes days.
>
> As a clarification: I am mining several ~500GB results of Heritrix
> crawls containig all versions of XML, HTML, inline CSS, inline Scripts,
> etc. They need to be accurately stripped from HTML/XML (accurately means
> without losing too much real text). The text/markup ratio has to be
> calculated and stored on a per-line basis since I'm applying a machine
> learning algorithm afterwards which uses those ratios as one factor to
> separate coherent text from boilerplate (menus, navigation, copyright etc.).
>
> I had anticipated a reply along the lines of "read the documents into a
> DOM object and extract the text from that". That is also problematic
> since it is not fast enough given the size of the input (That is an
> assumption; I haven't benchmarked the FPC DOM implementation yet.), and
> I don't see how I can calculate the text/markup ratio per line in a
> simple fashion when using a DOM implementation.
>
> I am *not* trying to clean or format simple or limited HTML on a string
> basis. For stuff like that, I wouldn't have asked. I actually wouldn't
> use Pascal for such tasks but rather sed or a Perl script at max.
>
> I would still highly appreciate further input.
More information about the fpc-pascal
mailing list