[fpc-pascal] Example: regular expressions and "hash-tables"

S. Fisher expandafter at yahoo.com
Thu Mar 21 02:14:59 CET 2013


Not actually a hash-table, but an AvgLvlTree, which can be used the
same way.  The AvgLvlTree unit comes with Lazarus; if you don't have
that, you can download avglvltree.pas here:

http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/components/lazutils/avglvltree.pas?root=lazarus&view=log

There doesn't seem to be a lot of information available on using
regular expressions and hash-tables in Free Pascal, so I decided to
post this.

The program reads a text file and counts the number of unique words,
and also displays the number of times the most common word was found.

Since TStringToPointerTree actually maps strings to pointers, I wrote
some tiny utility functions to handle the casting from integers to
pointers and back.

Although I was worried that this tree-structure would be slower than a
hash-table, it turned out to be quite fast.



{$mode objfpc}

uses
  AvgLvlTree,
  regexpr,
  classes,
  dateutils,
  Sysutils;

const
  file_to_process = 'Bartlett--Quotations.txt';

type
  // string-integer table
  t_si_table = TStringToPointerTree;
  t_si_item = PStringToPointerItem;

procedure si_table_put( table: t_si_table; k: string; v: longint );
begin
  table[ k ] := pointer( v )
end;
function si_table_get( table : t_si_table; key : string ): longint;
begin
  exit( longint( table[ key ] ) )
end;
procedure si_table_inc( table: t_si_table; k: string );
begin
  table[ k ] := pointer( si_table_get( table, k ) + 1 ) ;
end;
function si_table_key( item : t_si_item ): string;
begin
  exit( item^.name )
end;
function si_table_val( item : t_si_item ): longint;
begin
  exit( longint( item^.value ) )
end;
function si_table_new( sensitive : boolean ): t_si_table;
begin
  exit( t_si_table.create( sensitive ) )
end;

var
  table : t_si_table ;
  table_item : t_si_item ;
  line, word : string;
  re : TRegExpr;
  max_count, count : int32;
  lines : TStringList;
  time_1 : TDateTime;

begin
  time_1 := time;

  re := TRegExpr.Create;
  re.Expression := '(?i)[a-z]+' ;  // case-insensitive

  table := si_table_new( true ) ; // true to make it case-sensitive.

  lines := TStringList.Create;
  lines.LoadFromFile( file_to_process );
  for line in lines do  begin
    if re.exec( line ) then  begin
      si_table_inc( table, re.match[0] );
      while re.execNext do
        si_table_inc( table, re.match[0] );
    end;
  end;
  lines.free;

  writeln( 'Unique word count: ', table.count );
  max_count := 0;
  for table_item in table do  begin
    count := si_table_val( table_item );
    if count > max_count then  begin
      word := si_table_key( table_item );
      max_count := count
    end;
  end;
  writeln( 'Commonest word: ', word, ' (', max_count, ')' );
  writeln( 'Elapsed time: ',
           milliSecondsBetween( time_1, time ),
           ' milliseconds.' );

  table.free;
  re.free
end.



Processing the file "Bartlett--Quotations.txt", which is about
319 kilobytes in size, produces this output:

Unique word count: 8911
Commonest word: the (2200)
Elapsed time: 94 milliseconds.


Documentation for TRegExpr can be found here:
http://regexpstudio.com/TRegExpr/Help/tregexpr_interface.html





More information about the fpc-pascal mailing list