[fpc-pascal]Word count function

Florian Klaempfl Florian.Klaempfl at gmx.de
Mon Oct 1 17:36:31 CEST 2001


At 10:34 01.10.01 -0400, you wrote:

>Gabor;
>
> > It just happened that I needed this function myself, and I found
> > two errors in my pseudo-code. Here it is, corrected:
> > [snip]
>
>Your algorithm is similar to what I had devised initially, but I was not 
>happy with the performance. In my case I was using it on files that could 
>(conceivably) be 20+ meg. Some of those files were taking 3-5 minutes for 
>the calculations (even with a 933mhz PIII). That, I felt, was unacceptable.
>
>Basically, I had a const that was my 10 or 12 delimiter characters and I 
>would then use the pos() function to see if each character of every string 
>was a delimiter. Seemed like a good idea at the time, but it was very slow.
>
>What I found to be faster was to do something like this:
>
>for each character of the string check to see if it's an upper or lower 
>case letter or number;
>  if TRUE, keep counting;
>  if FALSE, do pos() on the delimiters const
>    if TRUE, it's the end of a word -- add to word count
>    if FALSE, it's not a word -- keep looking
>
>Here's an idea of what I had done...
>
>
>Function GetWords (StringToCheck : string) : longint;
>
>const
>  DELIMITERS = ' .,!?_-)}]>;:=@/\#9';

Replace the string by a set, this is much faster:
const
DELIMITERS : set of char = [' 
','.',',','!','?','_','-',')','}',']','>',';',':','=','@','/','\',#9];


>var
>  Index       : longint;
>  LineLength  : longint;
>  Loop        : longint;
>  Words       : longint;
>  CurrentChar : char;
>
>
>begin
>
>   Words := 0;
>  Index := 0;
>  LineLength := length (StringToCheck);

>   if LineLength <> 0 then   // don't check empty srings

You can leave away the if, the while index<linelength is false if 
linelength is 0 because
index is set to 0 above.


>     while Index < LineLength do
>     begin
>       inc (Index);
>       CurrentChar := StringToCheck [Index];
>
>        while (Index < LineLength) and ((CurrentChar >= 'a') and 
> (CurrentChar <= 'z')) and
>             ((CurrentChar >= 'A') and (CurrentChar <= 'Z')) and 
> ((CurrentChar >= '0') and (CurrentChar <= '9'))

Using

while (Index < LineLength) and (CurrentChar in ['A'..'Z','a'..'z','0'..'9']) do

might be faster as well

>             do inc (Index);   // skip all the "word" characters
>
>        // don't count double delims, like a period followed by a space, 
> as 2 words

        if StringToCheck [succ (Index)] in DELIMITERS then

>
>       begin
>         inc (Words);
>         inc (Index);   // move index past current delim char
>       end;
>     end;
>
>
>   GetWords := Words;
>
>end;






More information about the fpc-pascal mailing list