[fpc-pascal] How create a full text search with TChmWriter?

Thu Feb 23 02:43:39 CET 2012

On 02/22/12 18:01, Mattias Gaertner wrote:

> 
> Yes, this helped. 
> Now the files are done in under a minute and with only 500MB. But then
> mem consumption goes up again. Then it goes down to 5GB and seems to
> be stuck in an endless loop. I cancelled it.
> 
> I tried with only 500 files and it worked. That means I get a help and
> it finds files. But choosing a page just shows black. And after that
> any page is black.
> Note: If I don't use the search but the Index, then I see the pages.

In lhelp try commenting chmcontentprovider.pas:1239
TIpChmDataProvider(DataProvider).OnGetHtmlPage:=@LoadingHTMLStream;

This procedure tries to modify the loaded html page to highlight search
terms in red.

> 
> I compiled the chm units with -Criot and found various range check
> errors and uninitialized variables which I can fix myself. But then I
> came to a point where don't know what to do:
> 
> chmfiftimain.pas(361,49) Warning: Constructing a class "TLeafNode" with abstract method "ChildIsFull"
> chmfiftimain.pas(72,15) Hint: Found abstract method: TFIftiNode.ChildIsFull(<TFIftiNode>,AnsiString,LongWord);
> 
> And I get an exception in:
> #5  0x00000000005552aa in CHILDISFULL (this=0x415906, 
>     AWORD=0x409d0c "\311\303f\220H\203\354(H\211\\$\bL\211d$\020L\211l$\030L\211t$ I\211\376I\211\365f\211\323L\211\350H\203", <incomplete sequence \370>, ANODEOFFSET=8234056) at chmfiftimain.pas:688
> #6  0x00000000005552aa in CHILDISFULL (this=0x7fffdd9a0b40, AWORD=0x0, ANODEOFFSET=3579960) at chmfiftimain.pas:688
> 
> 

A parent node is always a TIndexNode so Parent.ChildIsFull =
TIndexNode(Parent).ChildIsFull;

I guess to fix the warning add TLeafNode.ChildIsFull and raise an
exception if it is called, since it shouldn't be.

I notice in #6 that AWord = nil. Afaik that shouldn't be the case.

The entries in TIndexNodes are the last word added to it's child node.

The basic overview is each word is written to a TLeafNode. When the
leafnode is full then it tells it's parent node the last word it wrote
then writes it's data to the final stream and is ready to be filled again.

The index nodes work the same. Every time it is full it tells it's
parent the last word it wrote and writes it's data to the final stream.

there is always one root index node that contains the last word of each
child node below it.

The index nodes only have information to find leaf nodes. The leaf node
has the actual data in it.

Each tier grows exponentially.

So 10000 leaf nodes only requires 3 tiers probably (2 index levels and a
leaf level)

IndexNode
IndexNode IndexNode ...
IndexNode IndexNode IndexNode IndexNode...
Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf Leaf ....

Anyway I need to think about this more...

Regards,

Andrew