Now that I'm accumulating my inbound feeds as XHTML, in order to database and search them, I find myself in the aggregator business, where I never planned to be. The tools I'm using to XHTML-ize my feeds are Mark Pilgrim's incredibly useful ultra-liberal feed parser and the equally useful HTML Tidy, invented by Dave Raggett, and maintained by folks like Charlie Reitzel, one of CMS Watch's Twenty Leaders to Watch in 2004 (along with yours truly).
Today I finally got around to using the ETag and conditional GET (If-Modified-Since) features of Mark Pilgrim's feed parser. (Apologies to my subscribees who, until now, have been treated impolitely by my indexer.) Of the 200+ feeds to which I subscribe, fifty 35 seem not to support either of these two bandwidth-saving techniques, which means they're probably getting battered unnecessarily by feedreaders. The victims are:
http://fieldmethods.net/backend.php http://groups.yahoo.com/group/syndication/messages?rss=1&viscount=15 http://matt.griffith.com/weblog/rss.xml http://nhpr.org/view_rss http://royo.is-a-geek.com/siteFeeder/GetFeed.aspx?FeedId=43 http://safari.oreilly.com/NewOnSafari.asp http://today.java.net/pub/q/29?cs_rid=47 http://today.java.net/pub/q/weblogs_rss?x-ver=1.0 http://usefulinc.com/edd/blog/rss http://w3future.com/weblog/rss.xml http://w3future.com/weblog/staplerFeeds/dubinko.xml http://www.burtongroup.com/weblogs/jamielewis/rss.xml http://www.eighty-twenty.net/blog?flav=rss http://www.eod.com/devil/rss10.xml http://www.fuzzyblog.com/rss.php?version=2.0 http://www.g2bgroup.com/blog/rss.xml http://www.gonze.com/index.cgi?flav=rss http://www.gotdotnet.com/team/dbox/rssex.aspx http://www.gotdotnet.com/team/tewald/rss.aspx?version=0.91 http://www.intertwingly.net/wiki/pie/RecentChanges?action=rss_rc http://www.lucidus.net/blog/rss.cfm http://www.markbaker.ca/2002/09/Blog/index.rss http://www.mobilewhack.com/index.rss http://www.neward.net/ted/weblog/rss.jsp http://www.newsisfree.com/HPE/xml/newchannels.xml http://www.openlinksw.com/blog/~kidehen/gems/rss.xml http://www.oreillynet.com/cs/xml/query/q/295?x-ver=1.0 http://www.pepysdiary.com/syndication/rss.php http://www.photo-mark.com/cgi-bin/rss2.cgi?set_id=16 http://www.pipetree.com/qmacro/xml http://www.testing.com/cgi-bin/blog/index.rss http://www.voidstar.com/module.php?mod=blog&op=feed&name=jbond http://www.xmldatabases.org/WK/blog?t=rss20 http://www.xmlhack.com/rss.php http://www.zope.org/SiteIndex/news.rss
Update: This list is 15 shorter than it was last night. Greg Reinacker wrote to point out that his feed does emit the ETag header. I checked, and what I originally reported was feeds that were missing one or the other of two different ways to tell the client a feed hasn't changed. But so long as one is in effect, you're OK. Now the list should include only feeds that support neither method, and that as a result cannot return the HTTP '304 Not Modified' response enabling a feedreader to skip an unnecessary fetch of an unchanged feed.
Here's a brief summary of the two methods. First, a site that supports Etag (but not Last-Modified), namely Greg's:
1. First fetch of Greg's feed: GET /gregr/weblog/rss.aspx HTTP/1.1 2. Etag response: HTTP/1.x 200 OK Date: Mon, 02 Feb 2004 14:17:01 GMT Server: Microsoft-IIS/6.0 Etag: "632104748500000000" 3. Second fetch of Greg's feed: GET /gregr/weblog/rss.aspx HTTP/1.1 If-None-Match: "632104748500000000" 4. 304 response: HTTP/1.x 304 Not Modified
Now here's a site that supports Last-Modified (but not Etag):
1. First fetch of David's feed GET /index.xml HTTP/1.1 Host: www.davidgalbraith.org 2. Last-Modified response HTTP/1.x 200 OK Server: Zeus/4.2 Last-Modified: Mon, 02 Feb 2004 02:02:55 GMT 3. Second fetch of David's feed GET /index.xml HTTP/1.1 If-Modified-Since: Mon, 02 Feb 2004 02:02:55 GMT 4. 304 response HTTP/1.x 304 Not Modified
And finally, here's a site from the list above, supporting neither method:
1. First request: GET /syndication/rss.php HTTP/1.1 Host: www.pepysdiary.com 2. Response includes neither Etag nor Last-Modified HTTP/1.x 200 OK Server: Apache/1.3.19 (Unix) PHP/4.0.4pl1 Transfer-Encoding: chunked Content-Type: text/html 3. Second request: GET /syndication/rss.php HTTP/1.1 Host: www.pepysdiary.com 4. Unchanged feed sent again: HTTP/1.x 200 OK Server: Apache/1.3.19 (Unix) PHP/4.0.4pl1 Transfer-Encoding: chunked Content-Type: text/html
If you're curious about which of these cases applies to your feed, one way to check is to use Mozilla's LiveHTTPHeaders extension, which is in fact how I took these snapshots.
Former URL: http://weblog.infoworld.com/udell/2004/02/01.html#a904