evolution/devel-docs/query/virtual-folder-in-depth.sgml

<!doctype article PUBLIC "-//Davenport//DTD DocBook V3.0//EN" []>

<!-- SGMLized by Bertrand <Bertrand.Guiheneuf@aful.org> -->

<article id="index">
  <artheader>
    <authorgroup>
      <author>
	<firstname>Giao</firstname>
	<surname>Nguyen</surname>
      </author>
    </authorgroup>

    <title>An in-depth look at the virtual folder mechanism</title>
    <abstract>
      <para>
	This document describes a different way of approaching mail
	organization and how all things are possible in this brave new
	world. This document does not describe physical storage issues
	nor interface issues.
      </para>
      <para>
	Historically mail has been organized into folders. These
	folders usually mapped to a single storage medium. The
	relationship between mail organization and storage medium was
	one to one. There was one mail organization for every storage
	medium. This scheme had its limitations.
      </para>
      <para>
	Efforts at categorizations are only meaningful at the instance that
	one categorized. To find any piece of data, regardless of how well
	it was categorized, required some amount of searching. Therefore, any
	attempts to nullify searching is doomed to fail. It's time to embrace
	searching as a way of life.
      </para>
      <para>
	These are the terms and their definitions. The example rules used are
	based on the syntax for VM (http://www.wonderworks.com/vm/) by Kyle
	Jones whose ideas form the basis for this. I'm only adding the
	existence of summary files to aid in scaling. I currently use VM and
	it's virtual-folder rules for my daily mail purposes. To date, my only
	complaints are speed (it has no caches) and for the unitiated, it's
	not very user-friendly.
      </para>
      <para>
	Comments, questions, rants, etc. should be directed at Giao Nguyen
	(grail@cafebabe.org) who will try to address issues in a timely
	manner.
      </para>
    </abstract>
  </artheader>

  <!-- Definitions -->
  <sect1 id="definitions">
    <title>Definitions</title>
    <sect2>
      <title>Store</title>
      <para>
	A location where mail can be found. This may be a file (Berkeley
	mbox), directory (MH), IMAP server, POP3 server, Exchange server,
	Lotus Notes server, a stack of Post-Its by your monitor fed through
	some OCR system.
      </para>
    </sect2>

    <sect2>
      <title>Message</title>
      <para>
	An individual mail message.
      </para>
    </sect2>
    <sect2>
      <title>Vfolder</title>
      <para>
	A group of messages sharing some commonality. This is the result of a
	query. The vfolder maybe contained in a store, but it is not necessary
	that a store holds only one vfolder. There is always an implicit
	vfolder rule which matches all messages. A store contains the vfolder
	which is the result of the query (any). It's short for virtual folder
	or maybe view folder. I dunno.
      </para>
    </sect2>
    <sect2>
      <title>Default-vfolder</title>
      <para>
	The vfolder defined by (any) applied to the store. This is not the
	inbox. The inbox could easily be defined by a query. A default rule
	for the inbox could be (new) but it doesn't have to be. Mine happens
	to be (or (unread) (new)).
      </para>
    </sect2>
    <sect2>
      <title>Folder</title>
      <para>
	The classical mail folder approach: one message organization per
	store.
      </para>
    </sect2>
    <sect2>
      <title>Query</title>
      <para>
	A search for messages. The result of this is a vfolder. There are two
	kinds of queries: named queries and lambda queries. More on this
	later.
      </para>
    </sect2>
    <sect2>
      <title>Summary file </title>
      <para>
	An external file that contains pointers to messages which are matches
	for a named query. In addition to pointers, the summary file should
	also contain signatures of the store for sanity checks. When the term
	"index" is used as a verb, it means to build a summary file for a
	given name-value pair.
      </para>
    </sect2>
  </sect1>

  <!-- Queries -->
  <sect1>
    <title>Queries</title>
    <para>
      Named queries are analogous to classical mail folders. Because named
      queries maybe reused, summary files are kept as caches to reduce
      the overall cost of viewing a vfolder. Summary files are superior to
      folders in that they allow for the same messages to appear in multiple
      vfolders without message duplications. Duplications of messages
      defeats attempts at tagging a message with additional user information
      like annotations. Named queries will define folders.
    </para>
    <para>
      Lambda queries are similar to named queries except that they have no
      name. These are created on the fly by the user to filter out or
      include certain messages.
    </para>
    <para>
      All queries can be layered on top of each other. A lambda query can be
      layered on a named query and a named query can be layered on a lambda
      query. The possibilities are endless.
    </para>
    <para>
      The layerings can be done as boolean operations (and, or, not). Short
      circuiting should be used.
    </para>
    <para>
      Examples:
      <programlisting>
(and (author "Giao")
  (unread))
      </programlisting>
      The (unread) query should only be evaluated on the results of (author
      "Giao").
      <programlisting>
(or (author "Giao")
  (unread))
      </programlisting>
      Both of these queries should be evaluated. Any matches are added to the
      resulting vfolder.
    </para>
  </sect1>

  <!-- Summary files -->
  <sect1>
    <title>Summary files</title>
    <para>
      Summary files are only meaningful when applied to the context of the
      default-vfolder of a store.
    </para>
    <para>
      Summary files should be generated for queries of the form:
      <programlisting>
(function "constant value")
      </programlisting>
      Summary files should never be generated for queries of the form:
      <programlisting>
	(function (function1))

	(and (function "value")
	(another-function "another value"))
      </programlisting>
      Given a query of the form:
      <programlisting>
	(and (function "value")
	(another-function "another value"))
      </programlisting>
      The system should use one summary file for (function "value") and
      another summary file for (another-function "another value"). I will
      call the prior form the "plain form".
    </para>
    <para>
      It should be noted that the signature of the store should be based on
      the assumption that new data may have been added to the store since
      the application generated the summary file. Signatures generated on
      the entirety of the store will most likely be meaningless for things
      like POP/IMAP servers.
    </para>
  </sect1>

  <!-- Incremental Indexing -->
  <sect1>
    <title>Incremental indexing</title>
    <para>
      When new messages are detected, all known queries should be evaluated
      on the new messages. vfolders should be notified of new messages that
      are positive matches for their queries. The indexes generated by this
      process should be merged into the current indexes for the vfolder.
    </para>
  </sect1>

  <!-- Can I have multiple stores -->
  <sect1>
    <title>Can I have multiple stores?</title>
    <para>
      I don't see why not. Again, the inbox is a vfolder so you can get a
      unified inbox consisting of all new mail sent to all your stores or
      your can get inboxes for each store or any combination your heart
      desire. You get your cake, eat it, and someone else cleans the dishes!
    </para>
  </sect1>

  <!-- Why all this? -->
  <sect1>
    <title>Why all this?</title>
    <para>
      Consider the dynamic nature of the following query:
      <programlisting>
(and (author "Giao")
  (sent-after (today-midnight)))
      </programlisting>
      today-midnight would be a function that is evaluated at run-time to
      calculate the appropriate object.
    </para>
  </sect1>

  <!-- Scenarios of usage and their solutions -->
  <sect1>
    <title>Scenarios of usage and their solutions</title>
    <sect2>
      <title>Mesage alterations</title>
      <para>
	This is a fuzzy area that should be left to the UI to handle. Messages
	are altered. Read status are altered when a new message is read for
	example. How do we handle this if our query is for unread messages?
	Upon viewing the state would change.
      </para>
      <para>
	One idea is to not evaluate the queries unless we're changing between
	vfolder views. This assumes that one can only view a particular
	vfolder at a time. For multi-vfolder viewing, a message change should
	propagate through the vfolder system. Certain effects (as in our
	example) would not be intuitive.
      </para>
      <para>
	It would not be a clean solution to make special cases but they may be
	necessary where certain defined fields are ignored when they are
	changed. Some combination of the above rules can be used. I don't
	think it's an easy solution.
      </para>
    </sect2>
    <sect2>
      <title>Message inclusion and exclusion</title>
      <para>
	Messages are included and excluded also with queries. The final query
	will have the form of:
	<programlisting>
	  (and (author "Giao")
	  (criteria value)
	  (not (criteria other-value)))
	</programlisting>
	Userland criterias may be a label of some sort. These may be userland
	labels or Message-IDs. What are the performance issues involved in
	this? With short circuiting, it's not a major problem.
      </para>
      <para>
	The criterias and values are determined by the UI. The vfolder
	mechanism isn't concerned with such issues.
      </para>
      <para>
	Messages can be included and excluded at will. The idea is often
	called "arbitrary inclusion/exclusion". This can be done by
	Message-IDs or other fields. It's been noted that Message-IDs are not
	unique.
      </para>
      <para>
	I propose that any given vfolder is allocated an inclusion label and an
	exclusion label. These should be randomly generated. This should be
	part of the vfolder description. It should be noted that the vfolder
	description has not been drafted yet.
      </para>
      <para>
	The result is such that the rules for a given named query is:
	<programlisting>
	  (and (user-query)
	  (label inclusion-label)
	  (not exclusion-label))
	</programlisting>
      </para>
    </sect2>
    <sect2>
      <title>Query scheduling</title>
      <para>
	Consider the following extremely dynamic queries:
	<programlisting>
	  A:
	  (and (author "Giao")
	  (sent-after (today-midnight)))

	  B:
	  (and (sent-after (today-midnight))
	  (author "Giao"))

	  C:
	  (or (author "Giao")
	  (sent-after (today-midnight)))
	</programlisting>
	Query A would be significantly faster because (author "Giao") is not
	dynamic. A summary file could be generated for this query. Query B is
	slow and can be optimized if there was a query compiler of some
	sort. Query C demonstrates a query in which there is no good
	optimization which can be applied. These come with a certain amount of
	baggage.
      </para>
      <para>
	It seems then that for boolean 'and' operations, plain forms should be
	moved forward and other queries should be moved such that they are
	evaluated later. I would expect that the majority of queries would be
	of the plain form.
      </para>
      <para>
	First is that the summary file is tied to the query and the store
	where the query originates from. Second, a hashing function for
	strings needs to be calculated for the query so that the query and the
	summary file can be associated. This hashing function could be similar
	to the hashing function described in Rob Pike's "The Practice of
	Programming". (FIXME: Stick page number here)
      </para>
    </sect2>
    <sect2>
      <title>Archives</title>
      <para>
	Many people are concerned that archives won't be preserved, archives
	aren't supported, and many other archive related issues. This is the
	short version.
      </para>
      <para>
	Archives are just that, archives. Archives are stores. Take your
	vfolder, export it to a store. You are done. If you load up the store
	again, then the default-vfolder of that store is the view of the
	vfolder, except the query is different.
      </para>
      <para>
	The point to vfolder is not to do away with classical folder
	representation but to move the queries to the front where it would
	make data management easier for people who don't think in terms of
	files but in terms of queries because ordinary people don't think in
	terms of files.
      </para>
    </sect2>
  </sect1>

  <!-- Miscellany -->
  <sect1>
    <title>Miscellany</title>
    <sect2>
      <title>Annotations</title>
      <para>
	There should be a scheme to add annotations to messages. Common mail
	user agents have used a tag in the message header to mark messages as
	read/unread for example. Extending on this we have the ability to add
	our own data to a message to add meaning to it. If we have a good
	scheme for doing this, new possibilities are opened.
      </para>
      <sect3>
	<title>Keywords</title>
	<para>
	  When sending a message, a message could have certain keywords attached
	  to it. While this can be done with the subject line, the subject line
	  has a tendency to be munged by other mail applications. One popular
	  example is the "[rR]e:" prefix. Using the subject line also breaks the
	  "contract" with other mail user agents. Using keywords in another
	  field in the message header allows the sender to assist the recipient
	  in organizing data automatically. Note that the sender can only
	  provide hints as the sender is unlikely to know the organization
	  schemes of the recipient.
	</para>
      </sect3>
    </sect2>
    <sect2>
      <title>Scope</title>
      <para>
	Let us assume that we have multiple stores. Does a query work on a
	given store? Or does it work on all stores? Or is it configurable such
	that a query can work on a user-selected list of stores?
      </para>
    </sect2>
  </sect1>

  <!-- Alternatives to the above -->
  <sect1>
    <title>Alternatives to the above</title>
    <para>
      Jim Meyer (purp@selequa.com) is putting some notes on where
      annotations needs to be located. They'll be located here as well as
      any contributions I may have to them.
    </para>
  </sect1>
</article>