[fpc-pascal] Re: stripping HTML

Sun Apr 17 14:08:25 CEST 2011

On 4/17/2011 11:00 AM, leledumbo wrote:
> http://www.festra.com/eng/snip12.htm
> Simple googling gives a lot of results, try: html strip (pascal OR delphi)

Thank you for your reply.

I feel I have to justify myself: I always do extensive web and list
archive searches before posting to a list (hence the infrequency of my
posts). I had actually found that snippet over a week ago but
immediately discarded it since it is obviously a toy solution. I have a
much better solution already using the PCRE library on a text stream,
sometimes re-reading portions of the stream by way of backtracking. The
problems with any approach like that (esp. 6-liners like the one linked
in your post, but also more elaborate buts still makeshift regular
expression magic) are:

1. They don't handle faulty HTML well enough.

2. They don't handle any multi-line constructs like comments or scripts.
Depending on how naively you read the input (e.g., using
TStringList.ReadFromFile), they even choke on simple tags with all sorts
of line breaks in between, which are frequently found (and which are, to
my knowledge, not even ill-formed). What do you do with this (for a start)?

'<div class="al#13#10#13ert">'

3. They are potentially not the most efficient solution, which is an
important factor if the stripping alone takes days.

As a clarification: I am mining several ~500GB results of Heritrix
crawls containig all versions of XML, HTML, inline CSS, inline Scripts,
etc. They need to be accurately stripped from HTML/XML (accurately means
without losing too much real text). The text/markup ratio has to be
calculated and stored on a per-line basis since I'm applying a machine
learning algorithm afterwards which uses those ratios as one factor to
separate coherent text from boilerplate (menus, navigation, copyright etc.).

I had anticipated a reply along the lines of "read the documents into a
DOM object and extract the text from that". That is also problematic
since it is not fast enough given the size of the input (That is an
assumption; I haven't benchmarked the FPC DOM implementation yet.), and
I don't see how I can calculate the text/markup ratio per line in a
simple fashion when using a DOM implementation.

I am *not* trying to clean or format simple or limited HTML on a string
basis. For stuff like that, I wouldn't have asked. I actually wouldn't
use Pascal for such tasks but rather sed or a Perl script at max.

I would still highly appreciate further input.
Regards
Roland