Jon Udell: The future of the file system

Tangled in the Threads
Jon Udell, May 30, 2001
The future of the file system

Unified namespaces, two-way linking, and context management

The "semantically impoverished" file system pulls us in with an inescapable gravitational force. Let's admit that we live there, and make it a better place.

In a column a few weeks back on document engineering, I cited some newsgroup discussion about granular addressing and naming of document elements. The thread, which has since continued, keeps coming back to the central issue of namespace management:

Rich Kilmer:

The person who has done a lot of work in namespace management is Hans Reiser (http://www.namesys.com). He is the author of the ReiserFS filesystem for Linux. His paper (http://www.namesys.com/whitepaper.html) goes into good detail on namespace issues -- really good detail!

Rich co-founded Roku Technologies, a company that rode the recent P2P wave with a product that made the existing email software on always-connected desktop PCs available for remote access. In a recent Forbes article, Ann Winblad said: "Startups such as Roku and Groove Networks have emerged to deliver frameworks to build other new P2P applications. With Roku's products, I can use my wireless device to access the files on my desktop (or yours) from anywhere." Despite excellent press, and some impressive bundling deals, Roku is now gone, another victim of the current downdraft.

As Rich contemplates his next move, he's been enriching our newsgroup with some of the fruits of his six years of research and development at Roku. I was particularly intrigued by his reference to Hans Reiser's whitepaper. Though I've been aware of the ReiserFS, I had mentally categorized it as "a high-performance journaling filesystem for Linux." It was a revelation to discover that, for Reiser, this technology is just a means to an end -- namely, turning the filesystem into the kind of object database that can help us model the real-world activities we engage in when we create, store, exchange, and search for information.

Reiser's paper is imbued by a vision that goes far beyond the nitty-gritty details of inodes and indexes. Here are some bits that resonate powerfully for me:

Hans Reiser:

Information owners tend to think of the cost of access as only subtracting from the value of their information. but it does much worse, it divides it. Three seconds rather than 1/3 second of access time means that the same information will spread to an order of magnitude fewer people, be used by them an order of magnitude fewer times, and be an order of magnitude less useful to the organization as a whole.

...time spent accessing rather than reading information detracts from our ability to wander speculatively after information that might be useful. The quality of the name space design determines these costs.

... most of the time the employees ... store most of their data in flat files in the semantically impoverished filesystem: the greater connectivity pulls them there.

We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information... What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system.

Of course you have to wrestle with inodes and indexes in order to have some hope of reaching this lofty goal. Programmers love to create lots of little files. This was, for example, what drove Gordon Letwin to create HPFS (OS/2's advanced filesystem): he wanted to be able to store thousands of email messages as files in a directory. Making that kind of storage work efficiently, though, is a major engineering challenge.

If you can lick the performance problem, the real problem emerges:

Rich Kilmer:

What do you store in the namespace to allow applications to cross each others' borders? An agreed-upon ontology is necessary to move beyond today's mess.

Reiser's examples are instructive. They blend namespaces that are hierarchical, relational, and -- for lack of a better word -- purely contextual. Consider an email message that's associated with three key terms in the following way:
/path/myEmail/projectA[to: rich, subject: taskB, from: jon, ...personC...]
Here projectA is a part of the filesystem namespace, taskB belongs to a pseudo-relational namespace, and personC is a term in a contextual (that is, full-text-indexed) namespace. Reiser points out, correctly, that today's systems require users to:

Identify the kind of namespace to which each term belongs.

Identify the tool that operates on that namespace.

Apply each different tool to each kind of namespace, then combine the results.

What he imagines, instead, is a storage system (and a syntax) that unifies these namespaces so that all information tools can easily and naturally work across them.

Imagine a filesystem directory that is specialized to store email messages. It indexes RFC822 headers one way, message bodies another way, and merges these two namespaces with the path namespace.

Right now, my email inbox illustrates the problem this scheme would solve. I'm forced to distinguish between "term appears in pathname," "term appears in subject," and "term appears in body." Worse, the subject and body fields are available only within the email application. If it could store messages in a filesystem that had a hybrid hierarchical/relational/fulltext capability, the metadata and the data associated with the messages would be available not only to the email program itself, but systemwide.

Here's how Dominic Amann envisions it:

Dominic Amann:

Start with an efficient filesystem that allows small files (such as ReiserFS). Then add an OS browser/shell level extension that allows each folder to contain a special object. This object is a viewer/filesystem "plugin" that tells the shell/browser which indexes are available for the folder, and the shell/browser can decide how to display them.

This would allow e-mail to be viewed by a variety of programs, and searchable/useable even by non-email apps because, for example, /var/spool/mail/dominic/ appears to contain

./thread
./subject
./date
./to
./from
./keywords

Alexander G.M. Smith:

It seems that half of that is already in BeOS's file system. E-mails are a special class of text files (text/x-email), with extra attributes (sender, date, subject, etc) extracted from the message and stored in file attributes. There are indices for the more frequently searched of those attributes, so you can quickly find all messages from someone, or in a certain date range, or whatever else you can express in a query string. The mail daemon automatically sets up those attributes when it receives a message, independent of whatever program you are using to manipulate messages (which doesn't have to be an "e-mail" program). The missing half is that it doesn't index the message body.

Now that you mention ReiserFS, it sounds like a good candidate for porting to BeOS. BFS isn't as fast as it could be for lots of small files (because of all the index updating), and the rest of BeOS can make good use of attributes, so ReiserFS could be a useful improvement. The hard part is writing the query language parser/interpreter.

The next step would be to make it non-hierarchical. As mentioned elsewhere you want to have relationships bidirectional between a phone number and the person, so a cyclic (not acyclic!) graph structure of relationships would be needed. Of course, some commands -- like "ls -R" -- would need to be improved to handle cyclic directories.

Two-way linking

Alexander's reference to "bidirectional relationships" touches on another theme explored in this fascinating thread.

Rich Kilmer:

I like the idea that a folder contains a single type of object. This is not necessarily useful for the end user, but from a system perspective, having a single type of object per folder would allow excellent optimizations for indexing, searching, etc.

You also want to bidirectionally link objects/files together, and semantically encode how they interrelate. An email message within a certain folder could be linked to a contact within another folder as sender-of-email; a spreadsheet in another folder could be linked to the same email as an attachment.

This would be a great way to standardize information management, especially on the PIM level.

Mark Wilcox:

And then we would have re-invented the Mac OS resource fork. No, this is a bad idea. You're forcing people to become slaves of their machines. Folders ideally shouldn't even be part of the filesystem, they should be totally virtual groupings. If I want to keep all of my millions of files in the root, I should be able to. Maybe that's a mess, but at least it's my mess.

Rich Kilmer:

No...I'm not advocating the resource fork. I'm advocating using metadata structures to validate data placed within them, and then leveraging all this to achieve massive interconnection. Go ask your computer to give the metadata about every Person or Organization stored within it The problem is, the computer does not understand Person as a thing.

In my view, an Email Box, a Telephone and a Person are all distinct objects stored as separate files within this hypothetical filesystem. And they are linked. The Email Box is the home email of the Person. Email Messages are objects as well (linked to boxes, not people). I realize this is way granular, but in my experience of designing information models, more granularity in modeling real-world objects leads to more consistency in the cognitive experience of the user.

It turns out that the number of these relationships in common use is fewer than 300. But between any two objects, it's probably fewer than 10.

Rich concludes that it's practical, and extremely useful, to instantiate these well-defined relationships as bidirectional links. That works well for retrieval because we are, after all, associative thinkers.

This doesn't mean we can or should abolish hierarchy. The brain may work associatively, but we have no natural mechanisms (that we're aware of) for storage reorganization or data transfer. Yet that's just what we constantly do with our electronic data: rearrange it, package it up and move it from place to place. The hierarchical properties of the file system helps us meet these requirements. One way to encode the links between a message and its context is simply -- as Dominic suggests -- to nest the link objects inside the message's folder. Now the mailbox can be backed up, compressed, and transferred.

But as Rich points out, the contains relationship is not the only way, and frequently not the best way, to model the real world. Reality looks more like a web of associations.

Rich Kilmer:

For each real world object, create a metadata class. The key to success here is not to create compound objects. People don't have phone numbers, phones do, so create a Person object and a Telephone object. Link them semantically in a bi-directional fashion:

Person -> home phone -> Telephone
Telephone -> the home phone of -> Person

Build up higher level collections of these object/links and you have the basis for what we all want...a semantic web.

Finally, allow people to continue to work in their existing tools (Outlook, Netscape, files, bookmarks, projects/tasks), while automatically mapping to and from the new metadata model. This compatibilty is very important...if you do not do this the system will never succeed.

Who, What, When, Where, Why, How

At the heart of Roku was a context engine. Here's why we need such a thing, and how we might use it:

Rich Kilmer:

In the digital world, we have so many tools at our disposal that we cannot learn them fast enough to create the mapping necessary to translate our goals into action. Could the computer help with this translation between goals and tools?

The point of the context engine was not simply to allow the person to organize their information for the sake of organization, but to create a system that enables the computer to help them achieve their goals through that organization.

To do that you need an organization of information...an ontology. We came up with six high-level categories:

Who - Entities (People, Groups, Organizations, etc)

What - Content (Documents, Web Pages, Books, Music, Video, etc)

When - Events (Meetings, Reminders, etc)

Where - Locations (Street Addresses, Web Sites, Rooms, Buildings, etc)

Why - Goals (Actions, To Dos, Projects, Project Tasks, etc)

How - Resources (Roles, Software/Hardware Tools, Morpological tools--like folders topics and searches, user interfaces, etc)

Roku began in 1995. It ended in May of 2001, during a time of reckoning for the New Economy and for the technologies driving it. The pendulum had to swing, but when people like Rich Kilmer, or Nelson Minar of Popular Power, write to me saying they've gone out of business -- as both recently have done -- it becomes clear the pendulum has swung too far.

The ideas embodied by Roku will, Rich says, "live again ... very soon...in a different language ... and in open source." I'm glad to hear it! I don't think there's a single way to define the semantic web, or to achieve the effects we imagine it will bring. Experimentation will be required. Our economic system rewards such experimentation unequally. The boom-and-bust cycle doesn't correlate well with long-term goals. I hope it can evolve into something closer to a steady state. In the meantime, be on the lookout for skunkworks projects. The venture capital may have dried up for now, but there's no shortage of important problems to solve, or of smart people trying to solve them.

Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He is the author of Practical Internet Groupware, from O'Reilly and Associates. Jon now works as an independent Web/Internet consultant. His recent BYTE.com columns are archived at http://www.byte.com/index/threads

This work is licensed under a Creative Commons License.