[fpc-pascal] Re: Text scan in text files - (was: Full text scan - PDF files)
md at delfire.net
Mon Nov 1 20:32:40 CET 2010
On Mon, Nov 1, 2010 at 4:02 PM, Tomas Hajny <XHajT03 at hajny.biz> wrote:
> On Mon, November 1, 2010 19:34, Marcos Douglas wrote:
>> On Mon, Nov 1, 2010 at 3:31 PM, Tomas Hajny <XHajT03 at hajny.biz> wrote:
>>> On Mon, November 1, 2010 19:10, Marco van de Voort wrote:
>>>> In our previous episode, Marcos Douglas said:
>>>>> <albertonarduzzi at yahoo.com> wrote:
>>>>> >> Somebody can help me please?
>>>>> >> I need to search strings in Text files using just FPC.
>>>>> > how about reading every line and then using Pos() to see if some
>>>>> string is
>>>>> > there?
>>>>> I don't think this way is the fast way :(
>>>>> I have many PDF files with several pages each.
>>>> You'll be surprised. I've done multi million line logfiles that way. A
>>>> pdf2txt is infinitely slow compared with such processing.
>>> Well, there at least two gotchas there. First, it's better to use a
>>> reasonable (= large enough) buffer size. Second, the simplest approach
>>> implying reading line by line and searching using Pos() obviously isn't
>>> sufficient for searching across line breaks, i.e. you either need to
>>> handle that yourself, or use some unit providing such functionality.
>> Which unit do you recommends?
> I don't have any specific one in mind, I just wanted to point out that you
> need to take care about that. Personally, I'd probably use my unit
> Buffered (http://www.volny.cz/xhajt03/buffered.zip) and handle the part
> related to spanning of text across lines myself if relevant (obviously,
> that depends on your use case), but I'm sure that there are more complete
> solutions for your needs readily available too; my comment was just to
> signalize that "Pos" mentioned by Marco may not be sufficient by itself.
> BTW, to answer your other e-mail too - I don't know about solutions for
> searching within PDF files directly, I'd also use the external converter
Ok Tomas. Thanks for informations and file (I'll see it).
More information about the fpc-pascal