Indexing and searching Outlook email

I never thought I'd find myself digging around in my Outlook message store, but Mark's SpamBayes addin -- which is written in Python -- turns out to be a great Python/MAPI tutorial. Borrowing heavily from his examples, I came up with a script to extract my Outlook mail to a bunch of files that I could feed to a standalone indexer. [Full story at O'Reilly Network]

This was a fun project that gave me a chance to explore three different technologies: the Lucene search engine, Jython, and Python's MAPI interface. As I learned this morning, my closing lament -- that the CPython/MAPI and Jython/Lucene halves of this project do not communicate directly -- is somewhat mitigated by the existence of Lupy (1, 2), a Python port of Lucene. But I think the general point still stands. Must every component be rewritten in every language? Let's not go there.

I'm only somewhat satisfied with the search solution I've cobbled together, by the way. The major challenge so far has been learning when and how to use various Lucene search idioms. For example, I can restrict messages to a date in March like so:

yager AND 03/??/03 -> 20 docs

But watch this:

yager and 03/??/03 -> 777 docs

Evidently 'AND' is a boolean conjunctive, but 'and' is just a noise word. And since Lucene (somewhat annoyingly, to my taste) defaults to an OR conjunction, this winds up being:

yager OR 03/??/03 -> 777 docs

It's harder to generalize the date to 2003:

yager AND ??/??/03 -> org.apache.lucene.queryParser.ParseException

You can't begin a term with a wildcard. This will work:

yager AND (0?/??/03 1?/??/03) -> 141 docs

But that's getting pretty darned geeky. Lucene also supports proximity search, but it's a subtle thing as well. Consider:

"from date mcalister dickerson"~20 -> 8 docs

This is a nicely fuzzy search in which the ~20 specifies a 20-word window, and 'from' and 'date' bind that window to the message header. In Outlook, apart from it being ungodly slow to search for messages where Matt McAlister and Chad Dickerson appear in To: or From: headers, I'd have to be too specific -- i.e., From: Matt, or To: Chad. On the other hand, proximity is a tricky thing:

"from date mcalister dickerson"~30 -> 8 docs
"from date mcalister dickerson"~40 -> 12 docs
"from date mcalister dickerson"~50 -> 17 docs

What's the "right" amount of fuzziness? And then there's this:

"from date mcalister dickerson" -> 0 docs

No docs are found because the literal string does not appear anywhere.

I'm the kind of person who'll play around with these variations, but in general, people expect not to have to. The Web has trained us, rightly, to expect that we just type in a word or two and get the "right" answer. I don't know what the stats are on use of Google's advanced search, or any advanced search, but my gut tells me such features are rarely used.

I used to think the answer was to standardize on query syntax. Now I think that might help some, but not much. More fruitful, perhaps, would be to use multiple search strategies in parallel, suggest "best" outcomes, and factor the user's choices into future determinations of "best."

For years now, we've been able to find things on the Web more easily than we can find things in our own personal data stores. There's a huge opportunity, and a huge need, to swing that pendulum back toward the center.


Former URL: http://weblog.infoworld.com/udell/2003/05/14.html#a690