[fpc-pascal] Text scan in text files - (was: Full text scan - PDF files)

Marcos Douglas md at delfire.net
Tue Nov 2 16:45:46 CET 2010


On Tue, Nov 2, 2010 at 11:38 AM, José Mejuto <joshyfun at gmail.com> wrote:
> Hello FPC-Pascal,
>
> Tuesday, November 2, 2010, 11:02:18 AM, you wrote:
>
> TH> If I understand it correctly, this assumes reading the whole file into
> TH> memory at once. Depending on the size of that file and other conditions,
> TH> this may or may not be advisable...
>
> Yes, and a pdf2text conversion will reduce the PDF file to a 1% of its
> original size, so unless you handle 10 gigabyte PDFs should be not
> problem in loading the whole file in memory.
>
> I doubt that there are memory problems as running pdf2text will for
> sure consume more memory that the result file size.
>
> Of course if you will end up with 300 megabytes txt files then a
> different approach would be needed using a buffer with a window over
> the size of the searched text.
>
> Also logic will be different if you would like to match one word,
> several words, large sentences, sequeces of chars, etc.

I need to search several words. So, I can't use Pos function to search
each word.
My algorithm need to read each word (token) just one time to be fast.
I'll define each separator for each Token like <space>, comma, "/",
"\", <enter>, etc. For each token found, I'll search a combination in
my lists of words. If I found a match, I need to know which page the
token was found...

Marcos Douglas



More information about the fpc-pascal mailing list