[fpc-pascal]Word count function

Gabor DEAK JAHN djg at tramontana.co.hu
Tue Oct 2 00:19:56 CEST 2001


At 10/1/01 10:34 AM, you wrote:

James,

 > Your algorithm is similar to what I had devised initially, but I was not
 > happy with the performance. In my case I was using it on files that could
 > (conceivably) be 20+ meg. Some of those files were taking 3-5 minutes for
 > the calculations (even with a 933mhz PIII). That, I felt, was unacceptable.

There is something definitely fishy around here, no implementation can be
faster than a single run across the whole text file... :-)) The reason for
your performance problem, as you have already noted, was not the algorithm
itself but using Pos. Pos is a rather expensive string operation, each time
you call it, it has to iterate through your string of delimiters. The set is
practically equivalent to a table lookup: just imagine an array[#0..#255] of
Boolean. Before you start, you fill this array with False, then put True
into the places you want to call a delimiter, for instance:

array[#32] := true; array [#9] := true;

Now, checking for a delimiter is as simple and fast as:

if array[c] then ...

Still, my algorithm is basically a little bit faster. The important idea is
to make sure you only check each character of the input stream once (and
this is the theoretical optimum, you cannot find all words without checking
all characters at least once but you don't need more than one check for each
character to learn whether it is a delimiter or not).

You first check with

   CurrentChar := StringToCheck [Index];

then, later, you check the next character using

   if pos (StringToCheck [succ (Index)],DELIMITERS) <> 0 then

However, if it's not a delimiter, you don't increment Index and you'll
re-check the very same input character in the next iteration.

But yes, you're right, this is hair splitting... :-))) However, it can
become quite important in some cases, not necessarily because of the speed,
but in some applications where you receive the input characters one by one
and you cannot go back to the previous character any more.


Bye,
    Gábor

-------------------------------------------------------------------
Gabor DEAK JAHN -- Budapest, Hungary.
E-mail: djg at tramontana.co.hu





More information about the fpc-pascal mailing list