Tangled in the ThreadsJon Udell, June 27, 2001
Web Namespace Design
Simple patterns ensure portabilitySimplicity and rigor are the essential qualities of a durable web namespace
As I was writing last week's column, I checked my homepage for a reference to an earlier column, but the link was broken. Say what? I soon found, as some of you have also found, that a planned migration of BYTE.com (from TechWeb's content management system to Dr. Dobb's CMS) had altered the former namespace. So for example, the column in question, which has been known to the world as http://www.byte.com/column/threads/BYT20010608S0001, had become http://www.byte.com/documents/BYT20010608S0001/. (The trailing backslash in the new namespace is required, by the way, in order to expand the shorter URL to its "real" form: http://www.byte.com/documents/s=705/BYT20010608S0001/index.htm.) The new namespace is internally self-consistent, but extant URLs don't know about it.
Given that much of my written portfolio for the last few years is represented by BYTE.com URLs, which are in wide circulation -- stored on web pages, in search engine indexes, in bookmark files -- this was discouraging news. But, I thought, let's just roll up our sleeves and fix it. After all, the old and new namespaces appeared to be algorithmically related. The protean Apache module mod_rewrite, created by the redoubtable Ralf Engelschall, is solely dedicated to solving these kinds of namespace maintenance problems. Perhaps, the solution could be as simple as:RewriteRule ^(/column/threads)/(.+)$ /documents/$1/ [PT]
Well, it wasn't. I'll explain why not, but first, some history.
A brief history of BYTE.com
BYTE.com began in the spring of 1995, as the research project that motivated my monthly Web Project column. The vast majority of its pages were HTML-ized BYTE articles. We converted those articles into a simple tagged format (this was several years before XML appeared on the radar), and then I wrote the code to convert that format into two electronic publications: the BYTE CD-ROM, and the BYTE.com web archive.
In 1997, I developed a new service for subscribers to BYTE magazine. Our policy had been to withhold, from the website, the most recent three issues of the magazine, so as not to cannibalize print sales. The new service would allow anybody to preview those three issues, and would allow magazine subscribers to view the full contents of those issues. The subscriber number, printed on the magazine's mailing label, was the credential used to authenticate subscribers. (Magazines are convenient that way! No, it didn't work for newsstand purchasers, only subscribers.)
The first step was to develop a script that conditionally delivered either previews or full views based on the absence or presence of a subscriber cookie. (Not coincidentally, you can see this same strategy at work today in Safari.) This was straightforward, but the namespace management problem that it presented was less so. I forget just what the script was called, but let's say it was article.pl. The URL for an article, which had been:http://www.byte.com/art/9707/sec6/art2.htmlwas now going to look like:http://www.byte.com/article.pl?i=9707&s=6&a=2
Since I controlled the archive's internal namespace, it would have been easy to rewrite the whole thing in order to achieve this transformation. But what about all those bookmarks and web pages pointing to http://www.byte.com/art/9707/sec6/art2.html?
There was another matter to consider. Although I was using mod_perl, and was getting really good performance out of article.pl, it seemed wasteful to run this script to deliver every page of the archive. With several years of accumulated content, and only a 3-month window of protected content, the vast majority of pageviews could continue to be served statically -- therefore much more efficiently.
The phrase "statically served" does not have the same negative connotations for me that it does for many people. When you generate an archive by means of a scripted transformation, as I was doing (and still do, for other projects), you can create many of the effects that people assume require dynamic page assembly -- including navigational links, image thumbnails, and "See more like this" controls. And you can radically streamline and simplify the infrastructure needed to pump out those pages. Web servers, after all, are first and foremost file servers, and that function need not (I think should not) become completely vestigial.
So, I hit upon the idea of a mixed namespace. Articles in the 3-month protected window would be served using article.pl, and older stuff would be served using the old filesystem-style format. It would be tricky to rewrite the archive to account for the changing boundary between the protected and open realms, as months progressed, but it would be doable.
The problem still, though, was the external namespace. A subscriber might email an article.pl reference to a nonsubscriber, or a nonsubscriber might email an old-style reference to a subscriber, and in either case there would be frustration.
What I really wanted was for everyone to use the same namespace, but for the behavior (preview vs. full access) to vary by identity (anonymous vs. authenticated). And I wanted to attach this behavior to the existing namespace so that the new scheme would roll out with zero disruption.
The magic of mod_rewrite
The Apache module mod_rewrite is a killer one, i.e. it is a really sophisticated module which provides a powerful way to do URL manipulations. With it you can nearly do all types of URL manipulations you ever dreamed about. The price you have to pay is to accept complexity, because mod_rewrite's major drawback is that it is not easy to understand and use for the beginner. And even Apache experts sometimes discover new aspects where mod_rewrite can help.
In other words: With mod_rewrite you either shoot yourself in the foot the first time and never use it again or love it for the rest of your life because of its power
- from Apache 1.3 URL Rewriting Guide
In that landmark document, Ralf goes on to explore a series of use cases for mod_rewrite. His write-up was in 1997, and remains today, the best primer on web namespace management that I know of.
In my particular case, I only needed to dip a toe into the waters of the deep magic. I started with these httpd.conf directives:RewriteEngine on RewriteRule ^/art/(9705|9706|9707)/(sec[0-9]+)/(art[0-9]+).html$ \ /cgi-bin/article.pl?i=$1&s=$2&a=$3 [R]
Here, the protected window was the three issues from May to July of 1997. URLs within this window would be routed to article.pl, then conditionally served as previews (to nonsubscribers) or full views (to subscribers). If you typed the URL:http://www.byte.com/art/9707/sec6/art1.html
you'd be redirected (that's what mod_rewrite's [R] flag does) to:http://www.byte.com/article.pl?i=9707&s=6&a=1
Cool! But, there was still a problem. The protected window was a moving target. Next month, it would be redefined as (9706|9707|9708). What would happen to the URL /article.pl?i=9705&s=4&a=1 after that article moved into the open realm? Of course, the script could issue a redirect back to the canonical filesystem-style URL. But really, it would be better never to have exposed the script-style URL to the world at all. Wouldn't it be great if the filesystem-style URL were the only one that people ever saw, but sometimes (under the covers) it invoked the scripted behavior?
That's trivial for mod_rewrite. You just have to change the [R] flag to [PT] (for Pass Through). Now, if July 1997 is a protected month, /art/9707/sec6/art1.html would silently invoke article.pl. When July 1997 stops being a protected month, the same URL just fetches the indicated file as it normally would, with no script overhead. Users would only see the canonical filesystem namespace in all cases.
This was such a delightful result that it still, years later, gives me pleasure to think about it. I've never thanked Ralf properly for giving this amazing tool to the world, so let me take this opportunity to do it: Thanks, Ralf!
mod_rewrite for IIS
Of course things are rarely as simple as you'd like. In this case, the fly in the ointment was that BYTE.com wasn't running on Linux/Apache, it was running on NT/IIS, the platform on which I began my web career.
During the development of the subscriber version of BYTE.com, it became clear for a number of reasons -- mod_rewrite prime among them -- that I was probably going to be switching to Linux/Apache. But in the spirit of the research project that was the original motivation for the site, I decided to explore two paths in parallel, and to prototype the subscriber site on both NT/IIS and Linux/Apache, then decide which to deploy. I even imagined running both, side by side behind a load balancer, as a real-world benchmark of the two platforms, and I still regret not having had the chance to do that.
The feasibility of the NT/IIS version hinged on the availability of something like mod_rewrite. It didn't exist, and still (to my knowledge) doesn't. Yes, there's Apache for Windows, but I regard this as a specialty item. In the Windows world, IIS is the mainstream web server. The absence of a well-known and robust mod_rewrite-like tool for IIS is, I think, the result of cultural bias. Regular expressions, which are at the core of mod_rewrite, of Perl, and of so much else in the Linux/Unix/Open-Source space, are not equally fundamental in the world of Windows development.
What would it take to recreate mod_perl for IIS? Well, a lot. As I've noted, mod_rewrite is a protean piece of software. It's a kind of programming language in its own right, and it's deeply wired into the guts of Apache. There is still, I'm convinced, a need and an opportunity to bring all this functionality to IIS developers. But in our case, a fairly simple ISAPI filter would suffice.
Like Apache, IIS processes requests in phases -- user authentication, HTTP header processing, URL-to-filename mapping. Each phase creates the opportunity for a server extension -- be it an Apache module, or an ISAPI filter -- to alter the server's behavior in some useful way. In this case I needed to hook into the URL mapping phase. In Visual C++, you can do this by way of the ISAPI Extension Wizard. Tell it you want to create a filter that handles URL mapping requests, and it will generate framework code that's ready to handle the SF_NOTIFY_URL_MAP event.
After I determined how to do a simple mapping, I handed the project off to my associate, Dave Rowell, who linked in Henry Spencer's regular expression library. The feasibility test worked. We were able to achieve the same common-namespace effect on IIS as on Apache. I can't show you exactly how, because I've since lost that code. But in any case, it was only a special-purpose hack. A more general mod_rewrite-like solution would have been (and think still is) of much greater interest. Perhaps such a thing does exist. If you know of one, do drop by my newsgroup and tell us about it. Some cultural cross-fertilization, with respect to regex-based web namespace management, would be a good thing.
So what became of the subscriber-access version of BYTE.com? It was built, tested, and ready to deploy (on Linux/Apache, by the way), when the axe fell on BYTE Magazine in 1998. I left it running on a machine in BYTE's Peterborough, NH, office, and I never saw it again. C'est la vie.
BYTE.com, Acts II and III
The original BYTE.com archive reappeared, in a somewhat mutilated form, as part of CMP's TechWeb. I went on to do other things. Then, in 1999, I was invited to bring my newsgroups and weekly column back to BYTE.com. This was Act II, the TechWeb era. During this era, the site's namespace evolved in new and, unfortunately, less patterned ways:/columns/hear_this/1999/03/0329hearthis.html /features/1999/02/dernpentium.html /special/PC_Expo_99/pictures2/Memorystickcards.html /column/hands_on/BYT20000329S0009 /column/threads/BYT20010608S0001 /threads/BYT20000408S0003 /column/BYT20000426S0006 /previous/columns_nov99.html /feature/BYT20000420S0001 /features/1999/06/0628Plot_Gates4.html
There are patterns here, but you can't express them in a single, and simple, rule. This, as it turns out, became a problem when the curtain came down last week on Act II, and came up on Act III. If you check, you'll see that the new version of BYTE.com is running on Linux/Apache. In theory, it could use mod_rewrite to map the old namespace. But the complexity of that namespace was its downfall.
I've talked to the new webmaster. He's a good guy, and he's got a huge job. Integrating the TechWeb version of BYTE.com into the DDJ infrastructure was only a small part of it. There are about a dozen other websites in his stable. He does, in fact, use mod_rewrite for namespace management. But life is short, and there just wasn't time to fully analyze the Act II BYTE.com namespace, or debug any side effects that might occur as a result of mixing a bunch of new rewrite rules together with a bunch of pre-existing ones. I can't pretend I'm pleased with the outcome. On the other hand, I like the DDJ guys, understand what had to happen and why (they weren't thrilled, either), and plan to collaborate fruitfully with them going forward.
So what about all those broken links? I'm trying to be philosophical. I always strive to make web namespaces be immutable and immortal. The other great primer on web namespace management is Cool URIs don't change, a W3C document. Here's a snippet:
- "I didn't think URLs have to be persistent - that was URNs."
This is the probably one of the worst side-effects of the URN discussions. Some seem to think that because there is research about namespaces which will be more persistent, that they can be as lax about dangling links as they like as 'URNs will fix all that'. If you are one of these folks, then allow me to disillusion you.
Absolutely right. The whole URI/URN/URL taxonomy has always struck me as a tad esoteric. The author of this document nails the real practical issue. You can, and should, design URLs, and one of the design constraints is durability.
You can see the W3C itself wrestling with these issues on its site, not always successfully. But the points raised in this primer are excellent. Some attributes, such as category, really are intrinsic to an URL. Others -- notably access mechanism, owner/maintainer, file type -- are not. Factoring out what doesn't belong in the URL is not rocket science; it's an excellent discipline; it will help URLs be more persistent; URNs may or may not ever emerge from the thinktank.
I have cringed more than once at my own failure to implement simple future-proofing strategies. I cringe now, realizing that references to my columns, all over the web, have broken. But the web is resilient. People can, and do, search for and find things that have moved. Newer stuff matters more than older stuff. The older stuff that does matter will resurface.
OK, OK, I admit I'm rationalizing. It's better to avoid this situation if you can. And so the moral of the story is: when you design a web namespace, be as rigorous and as simple as you can. I've written elsewhere about how careful namespace design enables powerful control over search results. The same technique is your best hope for portability.
Author's Note: The NNTP newsgroups which began in Act I, and which during Act II informed many of these columns, remain active. What has disappeared, in Act III, is the web mirror of those newsgroups. We're still sorting out how (or whether) to connect the newsgroups to the web forums system on the new BYTE.com. Meanwhile, here's a reminder as to which NNTP groups exist, and what they're about.
name subject NNTP URL joncon Jon Udell's projects: Zope, digital IDs, servlets, Perl, Web statistics, whatever Jon's working on lately news applications Applications: what's available, what works, what doesn't news databases Databases: RDBMS vs ODBMS, data warehousing, database development tools, replication news chips CPUs, FPUs, DSPs: the x86 saga, non-x86 technologies, 64-bit processors, multiprocessor issues, MMX news networking Net: ADSL, gigabit Ethernet, ATM, IP multicast, firewalls news programming Programming: data structures, compilers, algorithms, tools, scripting news systems Systems: storage devices, memories, OS configuration issues, peripherals, add-in boards news webtech Web: browsers, servers, protocols, application development strategies news
Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He is the author of Practical Internet Groupware, from O'Reilly and Associates. Jon now works as an independent Web/Internet consultant. His recent BYTE.com columns are archived at http://www.byte.com/tangled/
This work is licensed under a Creative Commons License.