[fpc-pascal] How create a full text search with TChmWriter?

Andrew Haines AndrewD207 at aol.com
Wed Feb 22 04:25:08 CET 2012


On 02/21/12 17:40, Mattias Gaertner wrote:
> On Tue, 21 Feb 2012 16:08:43 +0100 (CET)
> Mattias Gaertner <nc-gaertnma at netcologne.de> wrote:
> 
>>
>> Andrew Haines <andrewd207 at aol.com> hat am 21. Februar 2012 um 15:24
>> geschrieben:
>> [...]
>>> Your chm file should not be bigger than the the uncompressed files
>>> unless you are writing only
>>> a couple of tiny html files.
>>
>> I have a few thousand html files, about 10k on average.
> 
> Ok, found it. The file extension was wrong.
> Fixing that and testing with 3 pages I get a Search. \O/
> 

:)

> But it only finds whole words. :-
> And clicking on a page gives a black page in lhelp. :(

The whole words is how the words are indexed. It would be fairly easy to
match a partial word against the beginning of an indexed word. Beyond
that if you want to find  "here" in "there" then you would have to dump
the search index and create a second search index -> ugly
> 
> Processing all files required 12 minutes and terrifying 4GB ram. :(
> Then comes some final part and it needed 9GB. I only have 8 so it
> became very slow. :(

"terrifying 4GB ram." :) That made me laugh :)

The memory usage is significantly changed by generating a search index?

> Then it went down to 5GB.
> Finally it crashed with an AV, just like with the LCL chm.
> And I have no chm.
> 
> Maybe some 64bit issue?

I had no crash .

I made an artificial chm file that contained the same file with a
different name 4000 times.

the html file was 13k bytes x 4000 (around 52 mb)

the chm was 2.9 mb

(I enabled LZX_USE_THREADS in chmwriter.pas)

time project1

real	10m50.497s
user	36m9.276s
sys	0m3.459s


According to top I used ~320mb of memory

I guess my chm does not have enough unique words and this is why the
memory usage is so low.




More information about the fpc-pascal mailing list