RSS reader stats re-analyzed

My report on June 3's access_log sparked a fair bit of commentary. I realized today, though, that I neglected to account for the rss.xml requests that were answered with an HTTP 304 ("not modified") response. The RSS community owes Simon Fell a vote of thanks for noticing, about a year ago, that the volume of redundant RSS traffic could be radically reduced by exploiting the HTTP/1.1 ETag and If-None-Match headers. Aggregators promptly jumped on Simon's suggestion, and soon after things got a lot faster and more efficient.

From a log analysis perspective, the result is that an RSS request can look like this:

"GET /udell/rss.xml HTTP/1.1" 304 0 "http://ranchero.com/software/netnewswire/" "NetNewsWire Lite/1.0.2 (Mac OS X)"

Or this:

"GET /udell/rss.xml HTTP/1.1" 200 39861 "http://ranchero.com/software/netnewswire/" "NetNewsWire Lite/1.0.1 (Mac OS X)"

In the former case, the newsreader was informed that there was no change to the RSS file since the last time it was fetched. In the latter case, there was a change, and 39861 bytes of RSS data were shipped to the client.

For my 55354 log entries on June 3, the big picture looks like this:

         total requests:  55354
   requests for rss.xml:  19165
   for rss.xml, non-304:   6986

Here's a more detailed breakdown:

	rss.xml requests	http 200 responses	% of 200s	http 304 responses	% of 304s
total	19165	6986	36%	12179	64%
newswire	4280	1179	28%	3101	72%
sharpreader	2941	506	17%	2435	83%
newsgator	1105	189	17%	916	83%
feedreader	1128	640	57%	488	43%

Clearly Simon's technique is saving everybody a ton of bandwidth.

Does this diminish the importance of RSS readers? It depends what you count. If you subtract the rss.xml requests from the total, there were 36,189 non-RSS requests. But of course, many of these were for images of coffee cups and other UI paraphernalia. The number of requests for HTML pages was only 7752, a number only slightly greater than the 6986 RSS requests that yielded fresh content. That's quite remarkable. And yes, I am curious to see whether and how today's <xhtml:body> experiment might affect that ratio.

Update: More fiddling yields this:

	"views" ¹	unique IPs	adjusted unique IPs ²	adjusted %
HTML	8410	1306	996	75%
RSS	6685	2005	1695	85%

¹ For HTML, requests for */ or *.html. For RSS, requests for rss.xml, excluding HEAD requests and 304 responses.

² For HTML, IPs found in HTML requests and not in RSS requests. For RSS, the inverse.

It appears there are two quite disjoint populations of readers. 75% of the HTML requests come from IP addresses not seen from RSS readers, and 85% of the RSS requests come from IP addresses not seen from HTML requestors. Another way to look at this is to combine all the IP addresses, dedupe them, and check how many appear twice. The number of IPs unique to the HTML + RSS combination is 3001. Of these, only 310 appear twice.

Former URL: http://weblog.infoworld.com/udell/2003/06/08.html#a716