[fpc-pascal] How create a full text search with TChmWriter?
Andrew Haines
AndrewD207 at aol.com
Wed Feb 22 04:25:08 CET 2012
On 02/21/12 17:40, Mattias Gaertner wrote:
> On Tue, 21 Feb 2012 16:08:43 +0100 (CET)
> Mattias Gaertner <nc-gaertnma at netcologne.de> wrote:
>
>>
>> Andrew Haines <andrewd207 at aol.com> hat am 21. Februar 2012 um 15:24
>> geschrieben:
>> [...]
>>> Your chm file should not be bigger than the the uncompressed files
>>> unless you are writing only
>>> a couple of tiny html files.
>>
>> I have a few thousand html files, about 10k on average.
>
> Ok, found it. The file extension was wrong.
> Fixing that and testing with 3 pages I get a Search. \O/
>
:)
> But it only finds whole words. :-
> And clicking on a page gives a black page in lhelp. :(
The whole words is how the words are indexed. It would be fairly easy to
match a partial word against the beginning of an indexed word. Beyond
that if you want to find "here" in "there" then you would have to dump
the search index and create a second search index -> ugly
>
> Processing all files required 12 minutes and terrifying 4GB ram. :(
> Then comes some final part and it needed 9GB. I only have 8 so it
> became very slow. :(
"terrifying 4GB ram." :) That made me laugh :)
The memory usage is significantly changed by generating a search index?
> Then it went down to 5GB.
> Finally it crashed with an AV, just like with the LCL chm.
> And I have no chm.
>
> Maybe some 64bit issue?
I had no crash .
I made an artificial chm file that contained the same file with a
different name 4000 times.
the html file was 13k bytes x 4000 (around 52 mb)
the chm was 2.9 mb
(I enabled LZX_USE_THREADS in chmwriter.pas)
time project1
real 10m50.497s
user 36m9.276s
sys 0m3.459s
According to top I used ~320mb of memory
I guess my chm does not have enough unique words and this is why the
memory usage is so low.
More information about the fpc-pascal
mailing list