News about Google News

InfoWorld's online folks have long complained about the absence of InfoWorld news stories from Google News. Here is the most striking illustration of the problem:
1. WWW.google.com/news?q=site:WWW.infoworld.com Results 1 - 10 of about 2,040,000
2. NEWS.google.com/news?q=site:WWW.infoworld.com No pages were found
3. NEWS.google.com/news?q=site:WEBLOG.infoworld.com Results 1 - 10 of about 297
To summarize:

  1. Lots of www.infoworld.com articles are in the main Google index.

  2. But no www.infoworld.com articles are in the Google News index.

  3. However some weblog.infoworld.com articles are in the Google News index.

From these observations I concluded that, for whatever reason, Google News must have decided that www.infoworld.com is not a news source (although weblog.infoworld.com is). Which raised the question: How exactly does Google News decide what qualifies as a news source?

When I asked that question of Google spokesperson Megan Lamb she offered the following guidelines which, though evidently not published anywhere, I am reporting here with her permission:

Google strives to be as inclusive as possible, without regard to political viewpoint or ideology, while also providing a high quality experience for our users. Some of the things we look for when evaluating news organizations include:

- The source offers information that is updated regularly.

- It is managed by an organization (not an individual) and includes organizational information on its site.

- The source does not include hate speech or pornography.

- The source does not allow open posting of content without editorial review.

- The source's website is technically conducive to inclusion.

Google News has assured InfoWorld that it does, of course, meet these criteria, and does qualify as a news source. So apparently I was wrong to conclude that www.infoworld.com was accidentally excluded from the club. But what could the problem be, then?

According to Google News product manager Nathan Stoll, the omission is a technical problem rather than an editorial one. The Google News crawler, he says, is a very different beast from the regular Google crawler. And while the regular crawler happily includes our stuff, the news crawler -- for reasons as yet undetermined -- doesn't.

I was surprised to learn this because I've only ever been aware of three user-agent strings (i.e., crawler signatures) broadcast by Google bots:

  1. GoogleBot (for the main index)

  2. GoogleBot-Image (for images)

  3. Feedfetcher-Google (for RSS feeds)
There's no separate signature for the news crawler. It identifies itself as GoogleBot too. Given that the main crawler and the news crawler use different algorithms for site traversal and page analysis, according to Stoll, I'd expect them to identify themselves differently. But perhaps for historical reasons, they don't.

As of today, InfoWorld's problem remains unresolved and is still under investigation. Arguably it is a conflict of interest for me to write about this matter, given that its resolution is in InfoWorld's financial interest and therefore indirectly mine. But the facts that have emerged about the editorial policy and technical nature of Google News seem, well, newsworthy. And since I am evidently a news source I thought I should pass them along.


Former URL: http://weblog.infoworld.com/udell/2006/07/19.html#a1489