RSS self-defense

Now that I'm accumulating my inbound feeds as XHTML, in order to database and search them, I find myself in the aggregator business, where I never planned to be. The tools I'm using to XHTML-ize my feeds are Mark Pilgrim's incredibly useful ultra-liberal feed parser and the equally useful HTML Tidy, invented by Dave Raggett, and maintained by folks like Charlie Reitzel, one of CMS Watch's Twenty Leaders to Watch in 2004 (along with yours truly).

Today I finally got around to using the ETag and conditional GET (If-Modified-Since) features of Mark Pilgrim's feed parser. (Apologies to my subscribees who, until now, have been treated impolitely by my indexer.) Of the 200+ feeds to which I subscribe, fifty 35 seem not to support either of these two bandwidth-saving techniques, which means they're probably getting battered unnecessarily by feedreaders. The victims are:

http://fieldmethods.net/backend.php
http://groups.yahoo.com/group/syndication/messages?rss=1&viscount=15
http://matt.griffith.com/weblog/rss.xml
http://nhpr.org/view_rss
http://royo.is-a-geek.com/siteFeeder/GetFeed.aspx?FeedId=43
http://safari.oreilly.com/NewOnSafari.asp
http://today.java.net/pub/q/29?cs_rid=47
http://today.java.net/pub/q/weblogs_rss?x-ver=1.0
http://usefulinc.com/edd/blog/rss
http://w3future.com/weblog/rss.xml
http://w3future.com/weblog/staplerFeeds/dubinko.xml
http://www.burtongroup.com/weblogs/jamielewis/rss.xml
http://www.eighty-twenty.net/blog?flav=rss
http://www.eod.com/devil/rss10.xml
http://www.fuzzyblog.com/rss.php?version=2.0
http://www.g2bgroup.com/blog/rss.xml
http://www.gonze.com/index.cgi?flav=rss
http://www.gotdotnet.com/team/dbox/rssex.aspx
http://www.gotdotnet.com/team/tewald/rss.aspx?version=0.91
http://www.intertwingly.net/wiki/pie/RecentChanges?action=rss_rc
http://www.lucidus.net/blog/rss.cfm
http://www.markbaker.ca/2002/09/Blog/index.rss
http://www.mobilewhack.com/index.rss
http://www.neward.net/ted/weblog/rss.jsp
http://www.newsisfree.com/HPE/xml/newchannels.xml
http://www.openlinksw.com/blog/~kidehen/gems/rss.xml
http://www.oreillynet.com/cs/xml/query/q/295?x-ver=1.0
http://www.pepysdiary.com/syndication/rss.php
http://www.photo-mark.com/cgi-bin/rss2.cgi?set_id=16
http://www.pipetree.com/qmacro/xml
http://www.testing.com/cgi-bin/blog/index.rss
http://www.voidstar.com/module.php?mod=blog&op=feed&name=jbond
http://www.xmldatabases.org/WK/blog?t=rss20
http://www.xmlhack.com/rss.php
http://www.zope.org/SiteIndex/news.rss

Update: This list is 15 shorter than it was last night. Greg Reinacker wrote to point out that his feed does emit the ETag header. I checked, and what I originally reported was feeds that were missing one or the other of two different ways to tell the client a feed hasn't changed. But so long as one is in effect, you're OK. Now the list should include only feeds that support neither method, and that as a result cannot return the HTTP '304 Not Modified' response enabling a feedreader to skip an unnecessary fetch of an unchanged feed.

Here's a brief summary of the two methods. First, a site that supports Etag (but not Last-Modified), namely Greg's:

1. First fetch of Greg's feed:
 
GET /gregr/weblog/rss.aspx HTTP/1.1
 
2. Etag response:
 
HTTP/1.x 200 OK
Date: Mon, 02 Feb 2004 14:17:01 GMT
Server: Microsoft-IIS/6.0
Etag: "632104748500000000"
 
3. Second fetch of Greg's feed:
 
GET /gregr/weblog/rss.aspx HTTP/1.1
If-None-Match: "632104748500000000"
 
4. 304 response:
 
HTTP/1.x 304 Not Modified

Now here's a site that supports Last-Modified (but not Etag):

1. First fetch of David's feed
  
GET /index.xml HTTP/1.1
Host: www.davidgalbraith.org
  
2. Last-Modified response
  
HTTP/1.x 200 OK
Server: Zeus/4.2
Last-Modified: Mon, 02 Feb 2004 02:02:55 GMT
  
3. Second fetch of David's feed
  
GET /index.xml HTTP/1.1
If-Modified-Since: Mon, 02 Feb 2004 02:02:55 GMT
  
4. 304 response
  
HTTP/1.x 304 Not Modified

And finally, here's a site from the list above, supporting neither method:

1. First request:
  
GET /syndication/rss.php HTTP/1.1
Host: www.pepysdiary.com
  
2. Response includes neither Etag nor Last-Modified
  
HTTP/1.x 200 OK
Server: Apache/1.3.19 (Unix) PHP/4.0.4pl1
Transfer-Encoding: chunked
Content-Type: text/html
  
3. Second request:
  
GET /syndication/rss.php HTTP/1.1
Host: www.pepysdiary.com
  
4. Unchanged feed sent again:
  
HTTP/1.x 200 OK
Server: Apache/1.3.19 (Unix) PHP/4.0.4pl1
Transfer-Encoding: chunked
Content-Type: text/html

If you're curious about which of these cases applies to your feed, one way to check is to use Mozilla's LiveHTTPHeaders extension, which is in fact how I took these snapshots.


Former URL: http://weblog.infoworld.com/udell/2004/02/01.html#a904