RSS reader stats re-analyzed

My report on June 3's access_log sparked a fair bit of commentary. I realized today, though, that I neglected to account for the rss.xml requests that were answered with an HTTP 304 ("not modified") response. The RSS community owes Simon Fell a vote of thanks for noticing, about a year ago, that the volume of redundant RSS traffic could be radically reduced by exploiting the HTTP/1.1 ETag and If-None-Match headers. Aggregators promptly jumped on Simon's suggestion, and soon after things got a lot faster and more efficient.

From a log analysis perspective, the result is that an RSS request can look like this:

"GET /udell/rss.xml HTTP/1.1" 304 0 "http://ranchero.com/software/netnewswire/" "NetNewsWire Lite/1.0.2 (Mac OS X)"

Or this:

"GET /udell/rss.xml HTTP/1.1" 200 39861 "http://ranchero.com/software/netnewswire/" "NetNewsWire Lite/1.0.1 (Mac OS X)"

In the former case, the newsreader was informed that there was no change to the RSS file since the last time it was fetched. In the latter case, there was a change, and 39861 bytes of RSS data were shipped to the client.

For my 55354 log entries on June 3, the big picture looks like this:

         total requests:  55354
   requests for rss.xml:  19165
   for rss.xml, non-304:   6986

Here's a more detailed breakdown:

rss.xml requests http 200 responses % of 200s http 304 responses % of 304s
total 19165 6986 36% 12179 64%
newswire 4280 1179 28% 3101 72%
sharpreader 2941 506 17% 2435 83%
newsgator 1105 189 17% 916 83%
feedreader 1128 640 57% 488 43%

Clearly Simon's technique is saving everybody a ton of bandwidth.

Does this diminish the importance of RSS readers? It depends what you count. If you subtract the rss.xml requests from the total, there were 36,189 non-RSS requests. But of course, many of these were for images of coffee cups and other UI paraphernalia. The number of requests for HTML pages was only 7752, a number only slightly greater than the 6986 RSS requests that yielded fresh content. That's quite remarkable. And yes, I am curious to see whether and how today's <xhtml:body> experiment might affect that ratio.

Update: More fiddling yields this:

"views" 1 unique IPs adjusted
unique IPs 2
adjusted %
HTML 8410 1306 996 75%
RSS 6685 2005 1695 85%

1 For HTML, requests for */ or *.html. For RSS, requests for rss.xml, excluding HEAD requests and 304 responses.

2 For HTML, IPs found in HTML requests and not in RSS requests. For RSS, the inverse.

It appears there are two quite disjoint populations of readers. 75% of the HTML requests come from IP addresses not seen from RSS readers, and 85% of the RSS requests come from IP addresses not seen from HTML requestors. Another way to look at this is to combine all the IP addresses, dedupe them, and check how many appear twice. The number of IPs unique to the HTML + RSS combination is 3001. Of these, only 310 appear twice.


Former URL: http://weblog.infoworld.com/udell/2003/06/08.html#a716