[fpc-pascal] Searching Text Files

Martin Collins mailinglists at collins-email.co.uk
Wed Mar 19 11:51:37 CET 2014


Thanks for your advice & help Mark,

Typical, after spending ages searching, I finally found a couple of 
pascal solutions this morning once I googled with the word Delphi 
instead of freepascal/lazarus. Then checked and of course freepascal has 
the same functionality.

The first is reading the file into a string list and then using pos to 
see if the search term exists. This is fairly simple and may meet my 
requirements.

The second is searchbuf which appears to require an understanding of 
pointers - I've only just got my head around creating my own classes so 
I may need to study a little more for this one!

However, both still require converting the pdf file to text first using 
pdftotext. So I'll keep on looking for a pure pascal solution.

Thanks & best regards,

Martin Collins

Martin Collins wrote:
> Hi,
>
> I'm writing a little personal program in Lazarus that manages pdf 
> files. One of the things I want to do is search for text/phrases 
> within the pdfs. Has anybody tried to do this before and if so what is 
> the best (easiest) way you've come across?
>
> I've detailed what I've been doing below, but this is for background 
> information, as after messing about with it for a couple of days I am 
> not so sure this is the most sensible way to go about this even if I 
> can get it work. The awk count command detailed below was just me 
> trying out a proof of concept and for the real search I was planning 
> on it being slightly more sophisticated, but failed at the first hurdle!
>
> I will appreciate your opinions and experiences please. Many thanks.
>
> Martin Collins
>
> Free Pascal Compiler version 2.6.2-5 [2013/07/25] for x86_64
> Lazurus SVN 1.3
> Awk - GNU Awk 4.0.1
>
>
> I'm using Linux and have access to all the opensource goodies that 
> offers. I Googled for a pure pascal solution and did not find 
> anything. So over the last couple of days I been experimenting with 
> pdftotext and then awk on the text files, both executed through TProcess.

One minor warning for background. On a given distro, the PDF-related 
utilities usually use a single underlying library. So if you come across 
a situation where you're having problems extracting content, it can be 
more useful to look at a system upgrade than spending time trying to 
hack in updated versions of utilities such as pdftotext.

> Working on the bash command line awk is fine but it seems to play up 
> when executed through TProcess. I think it's an awk (or stupid me) 
> problem rather than a TProcess (note: I am an awk novice and not an 
> experienced programmer in general!).
>
> The bash command line awk instruction (to count the number of search 
> string instances) -
>
> awk '$1 ~ /searchstring/ {++c} END {print c}' FS=: textfile.txt
>
> In a simple pascal program to replicate the above, this works;
>
>     ...
>     aString := 'awk ''$1 ~ /searchstring/ {++c} END {print c}'' FS=: 
> textfile.txt';
>     AProcess.CommandLine := aString;
>     AProcess.Execute;

I'd suggest reading the file back into a stringlist, and manipulating it 
in Pascal. There might be efficiency problems if you're dealing with 
/really/ big files, but that way you'll be able to move forward and 
backward in the file if you want context, have a chance at handling 
UTF-8 properly and so on.

AWK was all very well when it was the only tool available, and I'm 
generally defensive of Perl. But if the main program is already written 
in Pascal you might as well use it for the text handling as well.

-- 
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20140319/70c1dd42/attachment.html>


More information about the fpc-pascal mailing list