[fpc-pascal] TFPGObjectList error

Marco van de Voort marcov at stack.nl
Mon Jul 2 21:08:11 CEST 2018


In our previous episode, Ryan Joseph said:
> > 
> > And to what page would this script then point when you find something ? If it was easy, it would have been done already.
> > 
> > But, here's your shot at contributing :)
> 
> The easiest thing would be to read the .chm file and search that like a
> .chm reader but the file is 3.4M so every request to search would require
> opening and reading the file (is that a deal breaker for a web server?). 

You can read the core CHM structures (which are maybe a few MBs), and
extract files only when needed, and then cache them, depending on RAM
settings. CHM was made for this to avoid decompressing everything to show
them. 

The CHMs together are 24MB, so +10 MB for compressed core structures +
100-200MB to dynamically cache frequently asked pages (like tocs and
sysutils),  256MB to the server app would go a long way.

But while the index (which is more or less a textsearch on titles of topics)
is easy to search, I'm not sure if we have read support for the
fulltextsearch. So possibly we only generate and write the tables, and not
read them and do actual search. (but maybe this is in lhelp)

But the lazarus and textmode IDE "F1" key is based on the indexes, not the fulltext
search. The indexes are about 15-20MB as XML, but a third of that as binary
BTree. (that form deduplicates strings)

> Another option is reading the .chm file once and build a SQL database from
> all the class/method names.

I don't think this really is a case for sql. A fastcgi server that loads the
files on startup, and checks if they are updates once per 24hrs, would be
enough.  fastcgi, since it would be stateful.

This is core structures of the LCL.chm (14MB compressed html, 20000 html files)


compressed   offset  uncompr size name          // meaning
 1       193170819       548926  /#STRINGS      // strings. Indexes, binary
						toc strings are deduplicated
                                                 stored here
 0             112          149  /#SYSTEM       // settings
 1       174179574       118460  /#TOCIDX       // binary form of  toc,
                                                also available (but larger) as 
                                                xml.
 1       190311922      1001152  /#TOPICS       // <topic title, urltbl entry> map
 1       191313074      1106149  /#URLSTR       // urls with urltbl  reference
 1       192419223       751596  /#URLTBL       // urltbl , topics lookup  table
                                                 for bidirectional lookup
 1       193719745      3407168  /$FIftiMain    // fulltextsearch data
 1       174176823         2751  /$OBJINST
 1       175356422      3549260  /$WWKeywordLinks/BTree   // index in binary  format


TOC is the giant page with a treeview with all topics. (more than one even,
by unit and alphabetically, the treeview on the first tab if you open the
chm in windows)

So it is about 12MB data uncompressed indexes and metadata for everything. The
uncompressed indexes and metadata are meant to be searched directly.

The LCL.chm is about 60% of the size of the current chms, so that would
extrapolate to all indexes being about  10/6*12=20MB.  Probably less, because
the LCL has little content and an enormous amount of topics




More information about the fpc-pascal mailing list