Ned Batchelder alerted me this morning to Amazon's new search feature:
Now Amazon lets you search the full text of its books. This is astounding, not only because of the further differences it highlights between Amazon and traditional bookstores, but because of the effort it must have taken to accomplish. The text seems to be from scans of pages, subjected to an OCR process. And not just the bulk of popular books, either. They've got all sorts of wild and wooly volumes available this way. I don't know how truly useful it will be, since full text searching can be extremely noisy, even before the OCR noise is factored in. [Ned Batchelder: October 2003]I wondered about the OCR strategy too. In this day and age, surely any publisher could provide electronic copy to an indexer. But then I drilled down and discovered something quite remarkable. I own a copy of Tesla: Man Out of Time. The other day, I was mentioning to someone that, according to that book, some of Nikola Tesla's writings are still classified. This query finds the passage I was remembering. Awesome! Now the physical book I bought from Amazon is more valuable to me. Its printed index has been augmented by a vastly more capable online index. This extremely useful capability is, by the way, also available to owners of books in the Safari Books Online service, though it correlates results only to chapter and section, not to page. Little-known fact: you need not be a Safari subscriber to use Safari as an augmented index to books you own.
Whether or not you own one of the books now searchable on Amazon, you can now view a scan of any page in the book that matches a query. Clearly that could be abused. Searching for 'Tesla' in the Tesla book finds almost every page, for example. So Amazon requires you to log in in order to view those pages; presumably they'll monitor activity and shut down people who try to read whole books this way.
When I designed Safari the notion of a fulltext-searchable book catalog was paramount. So was the notion of a browseable catalog that exposes introductory chunks of every section of every chapter to the public Web. This was a conscious strategy to create "Google surface area" -- and for a while, it worked. When you searched for a term found in an O'Reilly book, the Safari page for that book often showed up. But as time went on, Google seemed less willing to take the linkbait. Currently, it appears to be finding only a few hundred of the over 1800 O'Reilly / Addison Wesley / New Riders / Prentice Hall / Que / Sams / Peachpit books in the Safari service, and then only the home pages of those books, not any of the tens of thousands of preview pages. Of course Google has no particular incentive to do an exhaustive job searching online catalogs. For businesses that are so incented, like Amazon, local search is the only way to guarantee coverage. I'll be fascinated to see whether and how such local search services federate -- with or without Google's cooperation.
Former URL: http://weblog.infoworld.com/udell/2003/10/24.html#a832