Measuring web mindshare

spidering hacks My old web mindshare calculator has been updated for the Spidering Hacks book. The original version from 1999 used AltaVista to measure what I called the web mindshare -- that is, the number of indexed inbound links -- for a collection of sites in a Yahoo! category. The new version is updated to use Google. Cool! That project was one of the first things that really got me thinking about what Web services would inevitably become. Here's how I described it in my book:

In effect, every web site is a scriptable component, and the Web as a whole is a vast library of such components. You can invoke these invidually from any scripting language that can issue HTTP requests and interpret the responses.

What's more, you can join components to achieve novel effects. For example, I've used Yahoo! and AltaVista in combination to measure the "mindshare" of web sites in specific categories. To do that, I wrote a Perl script that uses Yahoo!'s namespace API to unroll the subdirectories under a node of the Yahoo! directory tree, yielding a consolidated list of URLs belonging to some category, such as /Science/Nanotechnology/. Then the script feeds that list of URLs, one at a time, to AltaVista, using its CGI API to ask, for each site, how many other pages in the AltaVista index refer to that site. The ranked list of these citation counts measures what I call the web mindshare of the sites.

Yahoo! wasn't designed to produce an unrolled list of sites in a category, but its web API can be made to do it. Likewise, AltaVista wasn't designed to count references to each of the sites in such a list, but its web API can be made to do it. These two macrocomponents, driven remotely by a 100-line Perl script (see http://www.byte.com/features/1999/03/udellmindshare.html), can be joined to create a new application that measures web mindshare. [Practical Internet Groupware, Chapter 8, Organizing Search Results]

So I tried out the new version, and here are the top 15 of the 45 sites I got back:

28200 http://www.computeractive.co.uk/
22200 http://www.byte.com/
21500 http://www.computerworld.com/
13700 http://www.eweek.com/
12300 http://www.cbronline.com/
7110 http://www.ugeek.com
6480 http://www.chip-online.com/
5790 http://www.cmpnet.com
4610 http://www.fcw.com/
2410 http://www.digitmag.co.uk/
2040 http://www.currents.net/
1950 http://www.advisor.com/
1400 http://www.esj.com/
1370 http://www.onmagazine.com/
1100 http://www.techworthy.com/

Except that didn't seem right. It wasn't just that InfoWorld didn't show up anywhere in the top 45. The last time I ran my version, in Jan 2001, it found almost 500 sites in the category. Then I saw why. The new version doesn't unroll the subcategories like the old one did. So I dusted that one off, ran it, and sure enough there were all usual suspects, plus some I don't remember from last time I tried this. And what a parade of names! GUUUI, The Interaction Designer's Coffee Break. Phone Losers of America. Thin Planet, Serving the Thin Client Industry. Juiced.GS, The Magazine for Apple IIgs Users. Who knew?

Then I got to wondering about how different search engines would rank the same set of sites. So I repeated the experiment with Google and AllTheWeb. Here's the top layer of results:

AltaVistaGoogleAllTheWeb
vnunet.com4943161
internet.com4162442
CMPnet1830933
TechRepublic1183434
Information Week964985
InfoWorld909236
Network World Fusion790227
Apache Week747318
Network Computing737879
InternetWeek7283310
Computeractive Online7094911
Computerworld6875612
PC World6777813
Network Magazine5645114
E-Commerce Times5370715
internet.com2570001
TechRepublic1940002
Datamation366003
vnunet.com290004
Computeractive Online282005
Information Week257006
Network Computing256007
Byte.com222008
Computerworld215009
InternetWeek2090010
InfoWorld1980011
Computer Gaming World1900012
MSDN Magazine1830013
Software Development Online1750014
C/C++ Users Journal1680015
internet.com12527621
CMPnet3637082
PC World3384783
Information Week2906984
InfoWorld2896265
Network Computing2637466
PC Magazine2519547
InternetWeek2447188
Business 2.02418089
eWeek21807110
Computerworld21608111
Electronic Gaming Monthly19433512
Computer Gaming World18870713
Official U.S. PlayStation Magazine18514614
vnunet.com18461015

You can see the complete results here, in four columns: AltaVista (as of Jan 2001), plus AltaVista, Google, and AllTheWeb as of today. This massive report not only gives your browser's table formatter a good workout, it raises interesting questions. The results differ pretty dramatically, in both magnitudes and rankings. I won't even pretend I can analyze why, though maybe insiders at the search engine companies could. Meanwhile, the best strategy might be to throw all the results into a pot and stir them up into a merged ranking. Everything's a service that can support recombinant growth. We're so fixated on Google nowadays, we tend overlook the possibility of yoking multiple services together. It was easy to do that four years ago, and it's just as easy today.


Former URL: http://weblog.infoworld.com/udell/2003/12/04.html#a859