Podcast transcription

Nathan McFarland used his own service, CastingWords, to transcribe last Friday's podcast on the topic of harnessing collective intelligence. I published the transcript here, after making a quick pass through the 5600-word document and making the following changes:

screen cast screencast
Homes" Holmes'
Suns Sun's
peer two peer peer to peer
computer farm compute farm
but kicked butt kicked
baring barring
tasks takes
librafox LibriVox
swan.com salon.com
acceptability accessibility
here hear
approvable a provable
statistic statistical
resource tool resource pool
from rare some rare
go over go after
if of
flash Flash
same same way
task it task in

The most amusing substitution, as Ben Hill pointed out, was swan.com (weight loss surgery and breast enlargement) for salon.com.

I'd rate the quality of the transcription as very good. Reading through it raised several interesting issues. First, there's the question of domain expertise. A tech-savvy transcriber wouldn't have written peer two peer, for example. Although CastingWords could segment its workers by domain, Amazon's MTurk doesn't yet provide a market system for valuing different domains of expertise -- or, for that matter, different levels of skill within a domain. That possibility is, in fact, one of the most interesting points that came up in our discussion.

Then there's the question of whether, or how, to edit transcripts. An accurate transcript can be painful to read. Recently, for example, Robin Good published an interview with me on his MasterNewMedia site. It is scrupulously accurate, and as such it reproduces all of my verbal tics. The worst one is that, instead of "um" and "you know," I tend punctuate speech with "right?" -- there are seventeen instances of that tic in the transcript.

The ear is more forgiving than the eye, so my own procedure, when transcribing, is to make people read better than they sound. If Quentin Clark were to compare this transcript of our interview with the raw audio, I'm sure he would find it to be somewhat different -- but pleasingly so.

Of course this kind of editing is much more time- and labor-intensive. I only allotted myself 15 minutes for a quick run-through of Hill/McFarland podcast transcript. It would have taken a lot longer to produce a text that would read smoothly. Note also that the audio from which the podcast was transcribed was itself carefully edited by me.

At each level of the process -- audio post-production, transcription, editing -- there are elements that I'd want to outsource and elements that I'd want to do myself. In audio post-production, for example, I'd want to do all of the selection and rearrangement, but I'd be happy to outsource things like normalization, mixing, and MP3 production. Then I'd like to review and correct the transcript as I've done here. And if I could find competent editors at attractive rates, I'd like to see if it would be feasible to review and revise their work rather than to do all the editing myself.

All this, of course, presumes that someone is paying for the final product. In the case of my podcasts themselves, never mind transcriptions of them, nobody is, at least not yet. But somebody might, in one way or another, so it's possible I could justify 42 cents per minute to publish raw transcripts routinely. Is it worth doing? Let me know what you think.

Update: David Hochman writes:

Is there a schema to represent a transcript as structured data that identifies the timepoint of each translated phrase in the audio file? Is there a common URL format (like a target) to represent a precise spot or range in an audio file?

If that data was set free, a whole bunch of someones would use their superpowers to mashup an audio editor, MP3 player, transcription, RSS, OPML, Bayesian filter, tag cloud, TTS (text-to-speech), and more.

The results could improve translation, synchronize a transcript with the audio, allow editing the transcript alongside the audio, transform audio into a human-readable web page or document with audio excerpts, and who knows what other magic.
This reminds me of a couple of things. First, the W3C's Timed Text initiative, about which I only that it exists, but should find out more.

Second, Doug Kaye's item about how the dynamic assembly of audio files affects things like the MP3 clipping service implemented by ITConversations and by me.

As is true everywhere else, this stuff needs to be an interplay between emergent applications and standards that coordinate them. The ideas have been around for a long time. Now routine use of such applications is within reach. But there aren't many people reaching for them yet, so it's still one of those chicken-end-egg deals.

James Andrewartha writes:

Annodex is the free software standard for annotating media. It even comes with a wiki that can be used by a community to do transcription, eg http://media.annodex.net/cmmlwiki/OSSForum-Trailer?start=8.
There's an associated IETF Internet draft: The Continuous Media Markup Language (CMML). I've read up a bit on Annodex, and this page about the Annodex authoring process crystallizes my concerns:
Annodex bitstreams are authored by interleaving CMML and audio-visual bitstreams into one multiplexed Ogg container.
For better or worse, it seems to me that popularizing Ogg containers will involve pushing a big rock up the proverbial Sisyphyean hill. If possible, I would rather find ways to work with popular formats. That said, you should certainly take a look at this in Firefox, because the effects achieved are spot on.

Former URL: http://weblog.infoworld.com/udell/2006/05/08.html#a1444