[fpc-pascal]Word count function
Florian Klaempfl
florian at klaempfl.de
Mon Oct 1 17:37:18 CEST 2001
At 10:34 01.10.01 -0400, you wrote:
>Gabor;
>
> > It just happened that I needed this function myself, and I found
> > two errors in my pseudo-code. Here it is, corrected:
> > [snip]
>
>Your algorithm is similar to what I had devised initially, but I was not
>happy with the performance. In my case I was using it on files that could
>(conceivably) be 20+ meg. Some of those files were taking 3-5 minutes for
>the calculations (even with a 933mhz PIII). That, I felt, was unacceptable.
>
>Basically, I had a const that was my 10 or 12 delimiter characters and I
>would then use the pos() function to see if each character of every string
>was a delimiter. Seemed like a good idea at the time, but it was very slow.
>
>What I found to be faster was to do something like this:
>
>for each character of the string check to see if it's an upper or lower
>case letter or number;
> if TRUE, keep counting;
> if FALSE, do pos() on the delimiters const
> if TRUE, it's the end of a word -- add to word count
> if FALSE, it's not a word -- keep looking
>
>Here's an idea of what I had done...
>
>
>Function GetWords (StringToCheck : string) : longint;
>
>const
> DELIMITERS = ' .,!?_-)}]>;:=@/\#9';
Replace the string by a set, this is much faster:
const
DELIMITERS : set of char = ['
','.',',','!','?','_','-',')','}',']','>',';',':','=','@','/','\',#9];
>var
> Index : longint;
> LineLength : longint;
> Loop : longint;
> Words : longint;
> CurrentChar : char;
>
>
>begin
>
> Words := 0;
> Index := 0;
> LineLength := length (StringToCheck);
> if LineLength <> 0 then // don't check empty srings
You can leave away the if, the while index<linelength is false if
linelength is 0 because
index is set to 0 above.
> while Index < LineLength do
> begin
> inc (Index);
> CurrentChar := StringToCheck [Index];
>
> while (Index < LineLength) and ((CurrentChar >= 'a') and
> (CurrentChar <= 'z')) and
> ((CurrentChar >= 'A') and (CurrentChar <= 'Z')) and
> ((CurrentChar >= '0') and (CurrentChar <= '9'))
Using
while (Index < LineLength) and (CurrentChar in ['A'..'Z','a'..'z','0'..'9']) do
might be faster as well
> do inc (Index); // skip all the "word" characters
>
> // don't count double delims, like a period followed by a space,
> as 2 words
if StringToCheck [succ (Index)] in DELIMITERS then
should beat the pos call easily
>
> begin
> inc (Words);
> inc (Index); // move index past current delim char
> end;
> end;
>
>
> GetWords := Words;
>
>end;
More information about the fpc-pascal
mailing list