Practical natural language processing, circa 2005

If documentation and quality assurance are dull topics -- and let's face it, they are -- you'd think that combining them in a product for assuring the quality of documentation would be a crashing bore. But here at the Gilbane content management conference in Boston today, I saw a fascinating demo of just such a product: Acrolinx's acrocheck. The company has deep roots in computational linguistics, has spent thirty years developing an engine for analyzing natural languages, and is focused entirely on using that engine to solve a single problem: defining rules for consistent use of terminology and grammar in technical documentation, and measuring how consistently teams of tech writers apply them.

The Acrolinx website has screencasts that show the final result: interactive rule enforcement within popular publishing tools such as Word and Framemaker. I wanted to see where the rules came from, and CEO Andrew Bredenkamp obligingly gave me a tour of the Eclipse development environment in which rules are developed. This is not something customers do. When you buy this product, you also hire Acrolinx consultants who are expert in the art of analyzing your texts, helping you define or integrate sets of terms, and -- most subtly -- tuning the grammatical rules for your domain.

The examples are mostly mundane: standardizing the term used to refer to a component, or the phrase used to express the idea "fill out the form," across massive sets of documents that exist in multiple translations. Most people never give this stuff a second thought. Only documentation managers can truly appreciate the value of controlling these details.

Although Andrew Bredenkamp doesn't like to speculate about general-purpose uses of his engine, I'll take the liberty of doing so for him. I've wondered for a long time how natural anguage processing will enter the mainstream. My guess is that email will be the vector. Suppose you could mine email for the following patterns:




I've written about this kind of thing before and will again. As with voice recognition, natural language processing isn't likely to deliver major breakthroughs. It's a long slog, but over decades you can look back and see the progress that's been made. Categorizing email and other kinds of interpersonal messages according to the speech acts they express is an age-old challenge, and the goal still eludes us, but it's nice to be reminded from time to time that the enabling technology is slowly but surely maturing.

Former URL: