159 lines
5.0 KiB
Plaintext
159 lines
5.0 KiB
Plaintext
<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
|
|
<!entity Evolution "<application>Evolution</application>">
|
|
<!entity Camel "Camel">
|
|
<!entity Ibex "Ibex">
|
|
]>
|
|
|
|
<article class="whitepaper" id="ibex">
|
|
|
|
<artheader>
|
|
<title>Ibex: an Indexing System</title>
|
|
|
|
<authorgroup>
|
|
<author>
|
|
<firstname>Dan</firstname>
|
|
<surname>Winship</surname>
|
|
<affiliation>
|
|
<address>
|
|
<email>danw@helixcode.com</email>
|
|
</address>
|
|
</affiliation>
|
|
</author>
|
|
</authorgroup>
|
|
|
|
<copyright>
|
|
<year>2000</year>
|
|
<holder>Helix Code, Inc.</holder>
|
|
</copyright>
|
|
|
|
</artheader>
|
|
|
|
<sect1 id="introduction">
|
|
<title>Introduction</title>
|
|
|
|
<para>
|
|
&Ibex; is a library for text indexing. It is being used by
|
|
&Camel; to allow it to quickly search locally-stored messages,
|
|
either because the user is looking for a specific piece of text,
|
|
or because the application is contructing a vFolder or filtering
|
|
incoming mail.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="goals">
|
|
<title>Design Goals and Requirements for Ibex</title>
|
|
|
|
<para>
|
|
The design of &Ibex; is based on a number of requirements.
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
First, obviously, it must be fast. In particular, searching
|
|
the index must be appreciably faster than searching through
|
|
the messages themselves, and constructing and maintaining
|
|
the index must not take a noticeable amount of time.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
The indexes must not take up too much space. Many users have
|
|
limited filesystem quotas on the systems where they read
|
|
their mail, and even users who read mail on private machines
|
|
have to worry about running out of space on their disks. The
|
|
indexes should be able to do their job without taking up so
|
|
much space that the user decides he would be better off
|
|
without them.
|
|
</para>
|
|
|
|
<para>
|
|
Another aspect of this problem is that the system as a whole
|
|
must be clever about what it does and does not index:
|
|
accidentally indexing a "text" mail message containing
|
|
uuencoded, BinHexed, or PGP-encrypted data will drastically
|
|
affect the size of the index file. Either the caller or the
|
|
indexer itself has to avoid trying to index these sorts of
|
|
things.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
The indexing system must allow data to be added to the index
|
|
incrementally, so that new messages can be added to the
|
|
index (and deleted messages can be removed from it) without
|
|
having to re-scan all existing messages.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
It must allow the calling application to explain the
|
|
structure of the data however it wants to, rather than
|
|
requiring that the unit of indexing be individual files.
|
|
This way, &Camel; can index a single mbox-format file and
|
|
treat it as multiple messages.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
It must support non-ASCII text, given that many people send
|
|
and receive non-English email, and even people who only
|
|
speak English may receive email from people whose names
|
|
cannot be written in the US-ASCII character set.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
While there are a number of existing indexing systems, none of
|
|
them met all (or even most) of our requirements.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="implementation">
|
|
<title>The Implementation</title>
|
|
|
|
<para>
|
|
&Ibex; is still young, and many of the details of the current
|
|
implementation are not yet finalized.
|
|
</para>
|
|
|
|
<para>
|
|
With the current index file format, 13 megabytes of Info files
|
|
can be indexed into a 371 kilobyte index file—a bit under
|
|
3% of the original size. This is reasonable, but making it
|
|
smaller would be nice. (The file format includes some simple
|
|
compression, but <application>gzip</application> can compress an
|
|
index file to about half its size, so we can clearly do better.)
|
|
</para>
|
|
|
|
<para>
|
|
The implementation has been profiled and optimized for speed to
|
|
some degree. But, it has so far only been run on a 500MHz
|
|
Pentium III system with very fast disks, so we have no solid
|
|
benchmarks.
|
|
</para>
|
|
|
|
<para>
|
|
Further optimization (of both the file format and the in-memory
|
|
data structures) awaits seeing how the library is most easily
|
|
used by &Evolution;: if the indexes are likely to be kept in
|
|
memory for long periods of time, the in-memory data structures
|
|
need to be kept small, but the reading and writing operations
|
|
can be slow. On the other hand, if the indexes will only be
|
|
opened when they are needed, reading and writing must be fast,
|
|
and memory usage is less critical.
|
|
</para>
|
|
|
|
<para>
|
|
Of course, to be useful for other applications that have
|
|
indexing needs, the library should provide several options, so
|
|
that each application can use the library in the way that is
|
|
most suited for its needs.
|
|
</para>
|
|
</sect1>
|
|
</article>
|