[fpc-pascal]Word count function

James_Wilson at i2.com James_Wilson at i2.com
Mon Oct 1 16:34:00 CEST 2001


Gabor;

> It just happened that I needed this function myself, and I found
> two errors in my pseudo-code. Here it is, corrected:
> [snip]

Your algorithm is similar to what I had devised initially, but I was not 
happy with the performance. In my case I was using it on files that could 
(conceivably) be 20+ meg. Some of those files were taking 3-5 minutes for 
the calculations (even with a 933mhz PIII). That, I felt, was 
unacceptable.

Basically, I had a const that was my 10 or 12 delimiter characters and I 
would then use the pos() function to see if each character of every string 
was a delimiter. Seemed like a good idea at the time, but it was very 
slow.

What I found to be faster was to do something like this:

for each character of the string check to see if it's an upper or lower 
case letter or number;
  if TRUE, keep counting;
  if FALSE, do pos() on the delimiters const
    if TRUE, it's the end of a word -- add to word count
    if FALSE, it's not a word -- keep looking

Here's an idea of what I had done...


Function GetWords (StringToCheck : string) : longint;

const
  DELIMITERS = ' .,!?_-)}]>;:=@/\#9';

var
  Index       : longint;
  LineLength  : longint;
  Loop        : longint;
  Words       : longint;
  CurrentChar : char;


begin

  Words := 0;
  Index := 0;
  LineLength := length (StringToCheck);

  if LineLength <> 0 then   // don't check empty srings
     while Index < LineLength do
     begin
       inc (Index);
       CurrentChar := StringToCheck [Index];

       while (Index < LineLength) and ((CurrentChar >= 'a') and 
(CurrentChar <= 'z')) and
             ((CurrentChar >= 'A') and (CurrentChar <= 'Z')) and 
((CurrentChar >= '0') and (CurrentChar <= '9'))
             do inc (Index);   // skip all the "word" characters

       // don't count double delims, like a period followed by a space, as 
2 words
       if pos (StringToCheck [succ (Index)],DELIMITERS) <> 0 then
       begin
         inc (Words);
         inc (Index);   // move index past current delim char
       end;
     end;


  GetWords := Words;

end;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freepascal.org/pipermail/fpc-pascal/attachments/20011001/0f716894/attachment.html>


More information about the fpc-pascal mailing list