How translucency could defuse the Turnitin/McLean High controversy

From a 9/22/2006 story in the Washington Post:

The for-profit service known as Turnitin checks student work against a database of more than 22 million papers written by students around the world, as well as online sources and electronic archives of journals. School administrators said the service, which they will start using next week, is meant to deter plagiarism at a time when the Internet makes it easy to copy someone else's words. But some McLean High students are rebelling. [Students Rebel Against Database Designed to Thwart Plagiarists]

Whether students' intellectual property rights are infringed by Turnitin's incorporation of their work into its database is an interesting question. But there should be no need to answer it in order to resolve the conflict. Here's why.

Turnitin's business is (or should be) only to detect plagiarism. To do that, it must build a database. But surprisingly and counterintuitively, the documents stored in that database need not be readable by human beings. To meet the business requirement, they need only be machine-readable versions derived from the human-readable originals.

Suppose the previous sentence appears in a student assignment. A cryptographic hash function can convert that sentence into this sequence of characters:

119ffe6a7c1f54b96beb6e38d822ebd0cb8df63d

The operation is called a one-way hash because although it will reliably and repeatedly convert the same sentence into the same sequence of characters, you cannot reverse it. The sentence is not recoverable from its derived sequence. What's more, it's very unlikely that two different sentences will yield the same sequence.

So here's a strategy for Turnitin. Convert each sentence of each student document into its corresponding sequence of characters, store only that sequence in the database, and discard the original sentence. Now the database contains no intellectual property subject to misuse. Even if it wanted to, Turnitin couldn't improperly mine the database. Neither could anyone who bought or stole the data.

But Turnitin can use the database for its sole valid purpose: to detect plagiarism. How? By deriving one-way hashes from each sentence of each document that it checks for plagiarism, and then by searching its database for those derived sequences of characters.

The strategy at work here is explored in an important but underappreciated book by Peter Wayner called Translucent Databases. The difference between transparency and translucency is the difference between clear glass and frosted glass. Light shines through both, but frosted glass creates a controlled distortion of objects behind the glass.

Though as yet poorly understood and rarely applied, the principle of translucent data management can help us defuse the ticking time bombs that many Internet services are becoming. In the many ongoing debates about privacy of data, accountability for its proper use, and risk of misuse, two critical questions are almost never asked:

What's the least quantity of customer data that a database operator must store in order to satisfy a business requirement?
What's the least useful representation of the data that can satisfy the requirement?

From the customer's perspective the incentives are clear. You want a service to store as little of your data as is necessary. And you want it stored in ways that are as useful to you as you require, while at the same time being as useless to the database operator as is feasible.

From the perspective of the service, though, it's a mixed bag. On the one hand, there's a strong incentive to gather as much data as possible, in as useful a form as is possible, in order to mine it, trade it, or sell it. On the other hand, there's a countervailing incentive to avoid the PR (and perhaps legal) disaster that will ensue if misuse of the data -- either intentional or inadvertent -- is revealed.

Turnitup may never have thought about running its database translucently. How might customers influence its thinking? Here's a three-point plan.

Appreciate that providers of services are subject to these conflicting incentives.
Think about the least quantity and least useful form of data that online services must control in order to perform their business functions.
Don't rely on wielding the stick of legal compulsion. Dangle the carrot of enlightened self-interest too. Storing more data than is necessary, in a more useful form than is necessary, can cause problems that service providers are incented to avoid.

Maybe the students will be found to have intellectual property rights in this case. If so, maybe the law will be inclined to defend those rights. But neither Turnitin nor the McClean High students are going to enjoy slogging through that swamp, and nobody is likely to reach solid ground on the other side anytime soon.

Translucency is a way to think outside the box and bypass the swamp. It's an idea whose time has come.

Update: Two readers, A. Henderson and Daniel Freudberg (who is a member of the McLean High School Committee for Students' Rights), wrote to ask essentially the same thing: Does cryptographic transformation create a derivative work also subject to copyright protection? It's a great question, and I don't know the answer, but I think I see a workaround. Turnitin could arrange for students to license these derivative works to it, for the sole purpose of comparison with other work.

Now, as Daniel Freudberg points out, the students are not merely objecting on intellectual property grounds. They also object to the preemptive nature of an anti-plagiarism regime that presumes everyone may be guilty until proven innocent. That's a reasonable concern, and it's outside the scope of the scheme I proposed. But now that I think of it, there are at least two reasons why students might wish to participate in a Turnitin-like scheme voluntarily:

To protect themselves against inadvertent plagiarism of other student work or, more likely, of published sources. (Related: See this example of how services like Google Books are about to transform the nature of citation.)
To protect their own work against intentional or inadvertent plagiarism by others.

While we're on the subject, I highly recommend Malcolm Gladwell's New Yorker piece from 2004, Something Borrowed, which explores how all so-called original creative work is also necessarily synthetic and derivative. The more that our technology can reveal the common DNA shared among different texts, the better we'll be able to judge what are proper and improper modes of sharing, and the more comprehensively we'll be able to visualize the flow of ideas.

Further update: Several folks -- Liz Lawley in email, and David S. in comments -- point out correctly that if texts are altered in the way I propose, instructors cannot evaluate the context in which non-original material appears.

In addition, Liz offers her take on Turnitin, from the perspective of a satisfied customer of the service:

Before I started using turnitin.con in my freshman classes, I generally had about a 10% heavy plagiarism (25% or more of the paper taken word-for-word from another source). That meant I was failing 4-6 students a quarter. Since starting to use turnitin four years ago, I've had only *one* case of a student plagiarizing. They seem to believe that the computer will catch them, but that I won't. That's made the relationship between me and my students less adversarial rather than more.

I'm really skeptical about the motivation of these copyright challenges. There's no monetary harm being done to these students, and no public display of their works. So while there may end up being grounds for a challenge in the letter of the law, I don't think it succeeds if you're looking at the spirit of the law.

What I've found is that most papers show up with 10% or more "matching" material, because it flags quotes. And it's not uncommon for a paper with 25% or more flagged material to end up being not plagiarized, but poorly written (lots of quotes strung together with minimal original content in between).

I'm not claiming to have perfect solutions either in this case or in the Prosper.com example I've used elsewhere. (Kim Cameron, however, has been noodling usefully on that one.) I am, however, trying to spark some discussion about the costs and benefits of translucency, an option that designers of software and information systems rarely (if ever) consider.

In the Turnitin example, we can imagine a scenario in which the method I proposed is a first line of defense. Its job is to find the subset of documents requiring further analysis. We can further imagine that Turnitin orchestrates a protocol whereby it retrieves items in that subset, from student- or university-controlled archives, and presents them to teachers in cleartext in an analytical framework that's part of the value added by their service, while never retaining those cleartext documents in its database.

Why go to all this trouble? Because we may be able to make some architectural choices about the surveillance systems we're creating. The McClean High students worry that use of Turnitin will make adversaries of students and teachers. From Professor Lawley's perspective that relationship already was adversarial. Now that students "seem to believe that the computer will catch them", it has become less so. As we work through these kinds of issues, what values do we uphold? And which technical architectures can best embody them?

In The Transparent Society, David Brin argues that the only sustainable model will be one in which the cameras watch everybody all the time, but nobody has exclusive access to what they see and record. We can debate the pros and cons of this arrangement but, as an architectural style, it has the compelling virtue of extreme simplicity.

Translucency, on the other hand, is inherently complex, and maybe that alone will doom it. But if it can support an architecture of surveillance that embodies values we care about, we owe it to ourselves to explore that option.

Former URL: http://weblog.infoworld.com/udell/2006/09/23.html#a1529