62 lines
1.9 KiB
Plaintext
62 lines
1.9 KiB
Plaintext
Stability
|
|
---------
|
|
* ibex_open should never crash, and should never return NULL without
|
|
errno being set. Should check for errors when reading.
|
|
|
|
|
|
Performance
|
|
-----------
|
|
* Profiling, keep thinking about data structures, etc.
|
|
|
|
* Check memory usage
|
|
|
|
* See if writing the "inverse image" of long ref streams helps
|
|
compression without hurting performance now. (ie, if a word appears in
|
|
more than half of the files, write out the list of files it _doesn't_
|
|
appear in). (I tried this before, and it wasn't working well, but the
|
|
file format and data structures have changed a lot.)
|
|
|
|
* We could save a noticeable chunk of time if normalize_word computed
|
|
the hash of the word and then we could pass that into
|
|
g_hash_table_insert somehow.
|
|
|
|
* Make a copy of the buffer to be indexed (or provide interface for
|
|
caller to say ibex can munge the provided data) and then use that
|
|
rather than constantly copying things. ?
|
|
|
|
|
|
Functionality
|
|
-------------
|
|
* ibex file locking
|
|
|
|
* specify file mode in ibex_open
|
|
|
|
* ibex_find* need to normalize the search words... should this be done
|
|
by the caller or by ibex_find?
|
|
|
|
* Needs to be some way to do a secondary search after getting results
|
|
back from ibex_find* (ie, for "foo near bar"). This either has to be
|
|
done by ibex, or requires us to export the normalize interface.
|
|
|
|
* Does there need to be an ibex_find_any, or is that easy enough for the
|
|
caller to do?
|
|
|
|
* utf8_trans needs to cover at least two more code pages. This is
|
|
tricky because it's not clear whether some of the letters there should
|
|
be translated to ASCII or left as UTF8. This requires some
|
|
investigation.
|
|
|
|
* ibex_index_* need to ignore HTML tags.
|
|
NAME = [A-Za-z][A-Za-z0-9.-]*
|
|
</?{NAME}(\s*{NAME}(\s*=\s*({NAME}|"[^"]*"|'[^']*')))*>
|
|
<!(--([^-]*|-[^-])--\s*)*>
|
|
|
|
ugh. ok, simplifying, we get:
|
|
<[^!](([^"'>]*("[^"]*"|'[^']*'))*> or
|
|
<!(--([^-]*|-[^-])--\s*)*>
|
|
|
|
which is still not simple. sigh.
|
|
|
|
* ibex_index_* need to recognize and ignore "non-text". Particularly
|
|
BinHex and uuencoding.
|