[fpc-pascal] TFPGObjectList error
Marco van de Voort
marcov at stack.nl
Mon Jul 2 21:08:11 CEST 2018
In our previous episode, Ryan Joseph said:
> >
> > And to what page would this script then point when you find something ? If it was easy, it would have been done already.
> >
> > But, here's your shot at contributing :)
>
> The easiest thing would be to read the .chm file and search that like a
> .chm reader but the file is 3.4M so every request to search would require
> opening and reading the file (is that a deal breaker for a web server?).
You can read the core CHM structures (which are maybe a few MBs), and
extract files only when needed, and then cache them, depending on RAM
settings. CHM was made for this to avoid decompressing everything to show
them.
The CHMs together are 24MB, so +10 MB for compressed core structures +
100-200MB to dynamically cache frequently asked pages (like tocs and
sysutils), 256MB to the server app would go a long way.
But while the index (which is more or less a textsearch on titles of topics)
is easy to search, I'm not sure if we have read support for the
fulltextsearch. So possibly we only generate and write the tables, and not
read them and do actual search. (but maybe this is in lhelp)
But the lazarus and textmode IDE "F1" key is based on the indexes, not the fulltext
search. The indexes are about 15-20MB as XML, but a third of that as binary
BTree. (that form deduplicates strings)
> Another option is reading the .chm file once and build a SQL database from
> all the class/method names.
I don't think this really is a case for sql. A fastcgi server that loads the
files on startup, and checks if they are updates once per 24hrs, would be
enough. fastcgi, since it would be stateful.
This is core structures of the LCL.chm (14MB compressed html, 20000 html files)
compressed offset uncompr size name // meaning
1 193170819 548926 /#STRINGS // strings. Indexes, binary
toc strings are deduplicated
stored here
0 112 149 /#SYSTEM // settings
1 174179574 118460 /#TOCIDX // binary form of toc,
also available (but larger) as
xml.
1 190311922 1001152 /#TOPICS // <topic title, urltbl entry> map
1 191313074 1106149 /#URLSTR // urls with urltbl reference
1 192419223 751596 /#URLTBL // urltbl , topics lookup table
for bidirectional lookup
1 193719745 3407168 /$FIftiMain // fulltextsearch data
1 174176823 2751 /$OBJINST
1 175356422 3549260 /$WWKeywordLinks/BTree // index in binary format
TOC is the giant page with a treeview with all topics. (more than one even,
by unit and alphabetically, the treeview on the first tab if you open the
chm in windows)
So it is about 12MB data uncompressed indexes and metadata for everything. The
uncompressed indexes and metadata are meant to be searched directly.
The LCL.chm is about 60% of the size of the current chms, so that would
extrapolate to all indexes being about 10/6*12=20MB. Probably less, because
the LCL has little content and an enormous amount of topics
More information about the fpc-pascal
mailing list