[fpc-pascal] Re: stripping HTML

Sun Apr 17 15:46:15 CEST 2011

HTML is not meant to be handled on a line-by-line basis as other
text-based formats. According to the specs, HTML is not line-based.
Browsers should display the following two HTML snippets identically:

  <p#13#10>one#13#10two</p>

and

  <p>one#13#10two</p#13#10>

With HTML tags removed both result to:

  one two

As such, a line-based text/markup ratio does not make much sense IMHO,
especially since browsers do strip line breaks in most text elements
except within <pre> ... </pre>.

That said, I believe that DIHtmlParser should care for most of your needs:

  http://yunqa.de/delphi/doku.php/products/htmlparser/index

DIHtmlParser meets most of your requirements:

  * Not DOM based, very fast.

  * Hand-crafted, linear-scan Unicode HTML parser.

  * Handles SCRIPTs and STYLEs well.

  * Simple "Extract Text" demo included, may be modified as needed.

Drawbacks:

  * Like HTML, DIHtmlParser is not line-based. An option is available
    to strip or preserve line breaks and white space.

  * Pre-compiled units available for Delphi only. The source code is
    required to compile with FreePascal.

Ralf

On 17.04.2011 14:08, Roland Schäfer wrote:

> I feel I have to justify myself: I always do extensive web and list
> archive searches before posting to a list (hence the infrequency of my
> posts). I had actually found that snippet over a week ago but
> immediately discarded it since it is obviously a toy solution. I have a
> much better solution already using the PCRE library on a text stream,
> sometimes re-reading portions of the stream by way of backtracking. The
> problems with any approach like that (esp. 6-liners like the one linked
> in your post, but also more elaborate buts still makeshift regular
> expression magic) are:
> 
> 1. They don't handle faulty HTML well enough.
> 
> 2. They don't handle any multi-line constructs like comments or scripts.
> Depending on how naively you read the input (e.g., using
> TStringList.ReadFromFile), they even choke on simple tags with all sorts
> of line breaks in between, which are frequently found (and which are, to
> my knowledge, not even ill-formed). What do you do with this (for a start)?
> 
> '<div class="al#13#10#13ert">'
> 
> 3. They are potentially not the most efficient solution, which is an
> important factor if the stripping alone takes days.
> 
> As a clarification: I am mining several ~500GB results of Heritrix
> crawls containig all versions of XML, HTML, inline CSS, inline Scripts,
> etc. They need to be accurately stripped from HTML/XML (accurately means
> without losing too much real text). The text/markup ratio has to be
> calculated and stored on a per-line basis since I'm applying a machine
> learning algorithm afterwards which uses those ratios as one factor to
> separate coherent text from boilerplate (menus, navigation, copyright etc.).
> 
> I had anticipated a reply along the lines of "read the documents into a
> DOM object and extract the text from that". That is also problematic
> since it is not fast enough given the size of the input (That is an
> assumption; I haven't benchmarked the FPC DOM implementation yet.), and
> I don't see how I can calculate the text/markup ratio per line in a
> simple fashion when using a DOM implementation.
> 
> I am *not* trying to clean or format simple or limited HTML on a string
> basis. For stuff like that, I wouldn't have asked. I actually wouldn't
> use Pascal for such tasks but rather sed or a Perl script at max.
> 
> I would still highly appreciate further input.