[fpc-pascal] Re: stripping HTML

Roland Schäfer roland.schaefer at fu-berlin.de
Sun Apr 17 16:11:46 CEST 2011

Thanks a lot for your reply.

On 4/17/2011 3:46 PM, Ralf Junker wrote:
> HTML is not meant to be handled on a line-by-line basis as other
> text-based formats. According to the specs, HTML is not line-based.
> Browsers should display the following two HTML snippets identically:
> As such, a line-based text/markup ratio does not make much sense IMHO,
> especially since browsers do strip line breaks in most text elements
> except within <pre> ... </pre>.

This is sort of off-topic, so I'll make it short: Yes, that is a problem
we are aware of. However, experiments with even simple threshholds
("remove lines with less than 50% text") were sort of successful. Simple
machine learning makes it much better. To avoid true paragraph detection
(which would be desirable but costly given the TB-sized input) we are
also experimenting with several line-based and non-line-based windows on
the input and cumulative html/text ratios for those windows. Also, this
is only stage one of the cleanup, and we run some more linguistically
informed and costly steps on the already much smaller amounts of data.

Maybe I'll give paragraph detection based on <p>, <div> etc. another
try, but we actually decided against that a while ago because we lost
huge amounts of valuable input due to non-use or very creative use of
such elements in actual web pages.

> That said, I believe that DIHtmlParser should care for most of your needs:

Yes, that looks perfect. I wouldn't even have a problem with the license
or with paying for it, and I even still have D7. However, my program has
to run on our Debian 64-bit servers.


More information about the fpc-pascal mailing list