[fpc-pascal] fast text processing

Wed Oct 31 08:41:43 CET 2007

> Or even better, give a clear problem description.

TASKS:

First, is to count number of words inside the document.
Second, is to count number of unique words inside the document.

INPUT:

Document format is using HTML-like format for storing articles. Here's 
the format:

<DOC> (contains an article)
  |- <DOCID> (contains article's ID)
  |- <TITLE> (contains article's title text)
  |- <TEXT> (contains article's content text)

CRITERIA:

A "word" criteria are:
- alphabetic (a..z) character sequence separated by whitespaces or 
hyphenation characters (space, tab, return, minus).
- character sequence that contains non alphabetic character is NOT 
considered as a word, ignored it.
- inside <TITLE> and <TEXT> tag, ignore anything inside <DOCID>.

Unique word criteria is case-insensitive. So "word" and "Word" is 
considered a same word.

EXAMPLE:

<DOC>
<DOCID>TEMPO-022904-111</DOCID>
<TITLE>This is the article title.</TITLE>
<TEXT>
This is The article-content 123abc.
</TEXT>
</DOC>

would give result:

Number of words = 10
Number of unique words = 6

10 words are 5 words from inside TITLE tag and 5 words from inside TEXT 
tag. The unique words are: "this", "is", "the", "article", "title.", and 
"content". "123abc." is ignored since it contains numbers.

The document may contain more than one article. Counting is done through 
out all articles found.

My program (using TStrings) requires about 0.4s to process the given 
document. :( Tested on the same machine using FPC v.2.0.4.

-Bee-

has Bee.ography at:
http://beeography.wordpress.com