evolution/doc/white-papers/mail/ibex.sgml

<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
<!entity Evolution "<application>Evolution</application>">
<!entity Camel "Camel">
<!entity Ibex "Ibex">
]>

<article class="whitepaper" id="ibex">

  <artheader>
    <title>Ibex: an Indexing System</title>

    <authorgroup>
      <author>
	<firstname>Dan</firstname>
	<surname>Winship</surname>
	<affiliation>
	  <address>
	    <email>danw@helixcode.com</email>
	  </address>
	</affiliation>
      </author>
    </authorgroup>

    <copyright>
      <year>2000</year>
      <holder>Helix Code, Inc.</holder>
    </copyright>

  </artheader>

  <sect1 id="introduction">
    <title>Introduction</title>

    <para>
      &Ibex; is a library for text indexing. It is being used by
      &Camel; to allow it to quickly search locally-stored messages,
      either because the user is looking for a specific piece of text,
      or because the application is contructing a vFolder or filtering
      incoming mail.
    </para>
  </sect1>

  <sect1 id="goals">
    <title>Design Goals and Requirements for Ibex</title>

    <para>
      The design of &Ibex; is based on a number of requirements.

    <itemizedlist>
      <listitem>
        <para>
	  First, obviously, it must be fast. In particular, searching
	  the index must be appreciably faster than searching through
	  the messages themselves, and constructing and maintaining
	  the index must not take a noticeable amount of time.
	</para>
      </listitem>

      <listitem>
        <para>
	  The indexes must not take up too much space. Many users have
	  limited filesystem quotas on the systems where they read
	  their mail, and even users who read mail on private machines
	  have to worry about running out of space on their disks. The
	  indexes should be able to do their job without taking up so
	  much space that the user decides he would be better off
	  without them.
	</para>

	<para>
	  Another aspect of this problem is that the system as a whole
	  must be clever about what it does and does not index:
	  accidentally indexing a "text" mail message containing
	  uuencoded, BinHexed, or PGP-encrypted data will drastically
	  affect the size of the index file. Either the caller or the
	  indexer itself has to avoid trying to index these sorts of
	  things.
	</para>
      </listitem>

      <listitem>
        <para>
	  The indexing system must allow data to be added to the index
	  incrementally, so that new messages can be added to the
	  index (and deleted messages can be removed from it) without
	  having to re-scan all existing messages.
	</para>
      </listitem>

      <listitem>
        <para>
	  It must allow the calling application to explain the
	  structure of the data however it wants to, rather than
	  requiring that the unit of indexing be individual files.
	  This way, &Camel; can index a single mbox-format file and
	  treat it as multiple messages.
	</para>
      </listitem>

      <listitem>
        <para>
	  It must support non-ASCII text, given that many people send
	  and receive non-English email, and even people who only
	  speak English may receive email from people whose names
	  cannot be written in the US-ASCII character set.
	</para>
      </listitem>
    </itemizedlist>

    <para>
      While there are a number of existing indexing systems, none of
      them met all (or even most) of our requirements.
    </para>
  </sect1>

  <sect1 id="implementation">
    <title>The Implementation</title>

    <para>
      &Ibex; is still young, and many of the details of the current
      implementation are not yet finalized.
    </para>

    <para>
      With the current index file format, 13 megabytes of Info files
      can be indexed into a 371 kilobyte index file&mdash;a bit under
      3% of the original size. This is reasonable, but making it
      smaller would be nice. (The file format includes some simple
      compression, but <application>gzip</application> can compress an
      index file to about half its size, so we can clearly do better.)
    </para>

    <para>
      The implementation has been profiled and optimized for speed to
      some degree. But, it has so far only been run on a 500MHz
      Pentium III system with very fast disks, so we have no solid
      benchmarks.
    </para>

    <para>
      Further optimization (of both the file format and the in-memory
      data structures) awaits seeing how the library is most easily
      used by &Evolution;: if the indexes are likely to be kept in
      memory for long periods of time, the in-memory data structures
      need to be kept small, but the reading and writing operations
      can be slow. On the other hand, if the indexes will only be
      opened when they are needed, reading and writing must be fast,
      and memory usage is less critical.
    </para>

    <para>
      Of course, to be useful for other applications that have
      indexing needs, the library should provide several options, so
      that each application can use the library in the way that is
      most suited for its needs.
    </para>
  </sect1>
</article>