[fpc-pascal] Re: Text scan in text files - (was: Full text scan - PDF files)

Tomas Hajny XHajT03 at hajny.biz
Mon Nov 1 20:02:19 CET 2010


On Mon, November 1, 2010 19:34, Marcos Douglas wrote:
> On Mon, Nov 1, 2010 at 3:31 PM, Tomas Hajny <XHajT03 at hajny.biz> wrote:
>> On Mon, November 1, 2010 19:10, Marco van de Voort wrote:
>>> In our previous episode, Marcos Douglas said:
>>>> <albertonarduzzi at yahoo.com> wrote:
>>>> >> Somebody can help me please?
>>>> >> I need to search strings in Text files using just FPC.
>>>> >
>>>> > how about reading every line and then using Pos() to see if some
>>>> string is
>>>> > there?
>>>> >
>>>>
>>>> I don't think this way is the fast way   :(
>>>> I have many PDF files with several pages each.
>>>
>>> You'll be surprised. I've done multi million line logfiles that way. A
>>> pdf2txt is infinitely slow compared with such processing.
>>
>> Well, there at least two gotchas there. First, it's better to use a
>> reasonable (= large enough) buffer size. Second, the simplest approach
>> implying reading line by line and searching using Pos() obviously isn't
>> sufficient for searching across line breaks, i.e. you either need to
>> handle that yourself, or use some unit providing such functionality.
>
> Which unit do you recommends?

I don't have any specific one in mind, I just wanted to point out that you
need to take care about that. Personally, I'd probably use my unit
Buffered (http://www.volny.cz/xhajt03/buffered.zip) and handle the part
related to spanning of text across lines myself if relevant (obviously,
that depends on your use case), but I'm sure that there are more complete
solutions for your needs readily available too; my comment was just to
signalize that "Pos" mentioned by Marco may not be sufficient by itself.

BTW, to answer your other e-mail too - I don't know about solutions for
searching within PDF files directly, I'd also use the external converter
myself.

Tomas





More information about the fpc-pascal mailing list