[fpc-pascal] fast text processing
Bee
bisma at brawijaya.ac.id
Wed Oct 31 08:41:43 CET 2007
> Or even better, give a clear problem description.
TASKS:
First, is to count number of words inside the document.
Second, is to count number of unique words inside the document.
INPUT:
Document format is using HTML-like format for storing articles. Here's
the format:
<DOC> (contains an article)
|- <DOCID> (contains article's ID)
|- <TITLE> (contains article's title text)
|- <TEXT> (contains article's content text)
CRITERIA:
A "word" criteria are:
- alphabetic (a..z) character sequence separated by whitespaces or
hyphenation characters (space, tab, return, minus).
- character sequence that contains non alphabetic character is NOT
considered as a word, ignored it.
- inside <TITLE> and <TEXT> tag, ignore anything inside <DOCID>.
Unique word criteria is case-insensitive. So "word" and "Word" is
considered a same word.
EXAMPLE:
<DOC>
<DOCID>TEMPO-022904-111</DOCID>
<TITLE>This is the article title.</TITLE>
<TEXT>
This is The article-content 123abc.
</TEXT>
</DOC>
would give result:
Number of words = 10
Number of unique words = 6
10 words are 5 words from inside TITLE tag and 5 words from inside TEXT
tag. The unique words are: "this", "is", "the", "article", "title.", and
"content". "123abc." is ignored since it contains numbers.
The document may contain more than one article. Counting is done through
out all articles found.
My program (using TStrings) requires about 0.4s to process the given
document. :( Tested on the same machine using FPC v.2.0.4.
-Bee-
has Bee.ography at:
http://beeography.wordpress.com
More information about the fpc-pascal
mailing list