My old web mindshare calculator has been updated for the Spidering Hacks book. The original version from 1999 used AltaVista to measure what I called the web mindshare -- that is, the number of indexed inbound links -- for a collection of sites in a Yahoo! category. The new version is updated to use Google. Cool! That project was one of the first things that really got me thinking about what Web services would inevitably become. Here's how I described it in my book:
In effect, every web site is a scriptable component, and the Web as a whole is a vast library of such components. You can invoke these invidually from any scripting language that can issue HTTP requests and interpret the responses.
What's more, you can join components to achieve novel effects. For example, I've used Yahoo! and AltaVista in combination to measure the "mindshare" of web sites in specific categories. To do that, I wrote a Perl script that uses Yahoo!'s namespace API to unroll the subdirectories under a node of the Yahoo! directory tree, yielding a consolidated list of URLs belonging to some category, such as /Science/Nanotechnology/. Then the script feeds that list of URLs, one at a time, to AltaVista, using its CGI API to ask, for each site, how many other pages in the AltaVista index refer to that site. The ranked list of these citation counts measures what I call the web mindshare of the sites.
Yahoo! wasn't designed to produce an unrolled list of sites in a category, but its web API can be made to do it. Likewise, AltaVista wasn't designed to count references to each of the sites in such a list, but its web API can be made to do it. These two macrocomponents, driven remotely by a 100-line Perl script (see http://www.byte.com/features/1999/03/udellmindshare.html), can be joined to create a new application that measures web mindshare. [Practical Internet Groupware, Chapter 8, Organizing Search Results]
So I tried out the new version, and here are the top 15 of the 45 sites I got back:
28200 | http://www.computeractive.co.uk/ |
22200 | http://www.byte.com/ |
21500 | http://www.computerworld.com/ |
13700 | http://www.eweek.com/ |
12300 | http://www.cbronline.com/ |
7110 | http://www.ugeek.com |
6480 | http://www.chip-online.com/ |
5790 | http://www.cmpnet.com |
4610 | http://www.fcw.com/ |
2410 | http://www.digitmag.co.uk/ |
2040 | http://www.currents.net/ |
1950 | http://www.advisor.com/ |
1400 | http://www.esj.com/ |
1370 | http://www.onmagazine.com/ |
1100 | http://www.techworthy.com/ |
Except that didn't seem right. It wasn't just that InfoWorld didn't show up anywhere in the top 45. The last time I ran my version, in Jan 2001, it found almost 500 sites in the category. Then I saw why. The new version doesn't unroll the subcategories like the old one did. So I dusted that one off, ran it, and sure enough there were all usual suspects, plus some I don't remember from last time I tried this. And what a parade of names! GUUUI, The Interaction Designer's Coffee Break. Phone Losers of America. Thin Planet, Serving the Thin Client Industry. Juiced.GS, The Magazine for Apple IIgs Users. Who knew?
Then I got to wondering about how different search engines would rank the same set of sites. So I repeated the experiment with Google and AllTheWeb. Here's the top layer of results:
You can see the complete results here, in four columns: AltaVista (as of Jan 2001), plus AltaVista, Google, and AllTheWeb as of today. This massive report not only gives your browser's table formatter a good workout, it raises interesting questions. The results differ pretty dramatically, in both magnitudes and rankings. I won't even pretend I can analyze why, though maybe insiders at the search engine companies could. Meanwhile, the best strategy might be to throw all the results into a pot and stir them up into a merged ranking. Everything's a service that can support recombinant growth. We're so fixated on Google nowadays, we tend overlook the possibility of yoking multiple services together. It was easy to do that four years ago, and it's just as easy today.
Former URL: http://weblog.infoworld.com/udell/2003/12/04.html#a859