InfoWorld's online folks have long complained about the absence of InfoWorld news stories from Google News. Here is the most striking illustration of the problem:
|1. WWW.google.com/news?q=site:WWW.infoworld.com||Results 1 - 10 of about 2,040,000|
|2. NEWS.google.com/news?q=site:WWW.infoworld.com||No pages were found|
|3. NEWS.google.com/news?q=site:WEBLOG.infoworld.com||Results 1 - 10 of about 297|
From these observations I concluded that, for whatever reason, Google News must have decided that www.infoworld.com is not a news source (although weblog.infoworld.com is). Which raised the question: How exactly does Google News decide what qualifies as a news source?
When I asked that question of Google spokesperson Megan Lamb she offered the following guidelines which, though evidently not published anywhere, I am reporting here with her permission:
Google strives to be as inclusive as possible, without regard to political viewpoint or ideology, while also providing a high quality experience for our users. Some of the things we look for when evaluating news organizations include:
- The source offers information that is updated regularly.
- It is managed by an organization (not an individual) and includes organizational information on its site.
- The source does not include hate speech or pornography.
- The source does not allow open posting of content without editorial review.
- The source's website is technically conducive to inclusion.
Google News has assured InfoWorld that it does, of course, meet these criteria, and does qualify as a news source. So apparently I was wrong to conclude that www.infoworld.com was accidentally excluded from the club. But what could the problem be, then?
According to Google News product manager Nathan Stoll, the omission is a technical problem rather than an editorial one. The Google News crawler, he says, is a very different beast from the regular Google crawler. And while the regular crawler happily includes our stuff, the news crawler -- for reasons as yet undetermined -- doesn't.
I was surprised to learn this because I've only ever been aware of three user-agent strings (i.e., crawler signatures) broadcast by Google bots:
As of today, InfoWorld's problem remains unresolved and is still under investigation. Arguably it is a conflict of interest for me to write about this matter, given that its resolution is in InfoWorld's financial interest and therefore indirectly mine. But the facts that have emerged about the editorial policy and technical nature of Google News seem, well, newsworthy. And since I am evidently a news source I thought I should pass them along.
Former URL: http://weblog.infoworld.com/udell/2006/07/19.html#a1489