Podcasting's transcription dilemma

I owe a huge thank-you to Eleanor Kruszewski, who has transcribed my audio interview with Intervoice's Ron Owens. And we should all thank her for raising the uncomfortable issue of podcast transcriptions which, for the most part, are missing in action.

In this thoughtful post Eleanor offers an important critique. She says, in part:

Now, I've said that audio doesn't work for me personally, and that's my bias. I'm not the only one though - this has been discussed by Marc Canter and Tim Bray. You can see Udell's response here. Tim's phrase "four guys talking" captures my problem exactly.

I don't intend to be hypercritical here, but it's important we look at what this mode of interaction means - what it allows and requires both for the creator and the listener. The creator - as we saw in Jon's case with the IVR conversation - benefits by just recording the conversation, doing the necessary processing (which is work and requires special equipment - but it's also tech tinkering, which is fun more than tedious), and serving it. They can share the content directly, without needing to mentally pre-process it. Listeners benefits too, as Jon says himself, by having direct access to the full context of a conversation, rather than have it distilled through the views and the prejudices of the interviewer.

It's true that text is lossy, but in podcasting we often just think about the benefits. The costs for the users is fairly high. Skimming is impossible. Searching is impossible. Pacing is out of control - if it's too fast, you must go back (which is very cumbersome given the poor interface of the web plugins I use here, but might be easier on, say, an iPod); if it's too slow, you're stuck. Take Eric Rice - podcaster extraordinaire - for example. Now I like Eric personally, but he is a showman. He loves podcasting because it puts him in control of the pacing and the delivery. Listening to his podcasts, you can tell he is a radio personality, and it is his personality that he's sharing in these 'casts. So Eric shows us that the line between content and entertainment blurs with podcasts. And that's great for all the people who tune in to talk radio. But wouldn't it be better if this media were indexable, searchable, and fungible... more like text. [EllementK: How can we tap into all this audio content?]

Eleanor, I violently agree. And I've been tackling this problem in a couple of different ways. But before I discuss them, let's hear from Russell Beattie who has recently experienced what, in podcasting circles, will probably come to be known as the drivetime revelation:

I drove down to work this morning and listened to Adam Curry on my new MuVo and was quite entertained for most of the ride down (he's actually a great morning DJ) and then coming back I listened to most of Jerry Fiddler's talk over at IT Conversations. WOW! What a difference it made to my hour long commute! I almost didn't want to get out of the car tonight. No searching for a station, no frustration in the topics that NPR is talking about tonight and though I love her dearly, no Terri Gross. Awesome!

Now both of these audio clips are definitely something that if I was in my house I would've listened to about 10 minutes of before I found something else to do just because that's how things are. But in the car, I had these and other 'casts queued up and ready to go and was completely absorbed. [RussellBeattie: PodCommuter]

I'm lucky enough not to have to commute, but my drivetime equivalent is my daily run. (Sometimes it is, instead, a swim, which is why I'm seriously considering one of these.) In these situations I don't want to skim, I want to be absorbed in a compelling story. The challenge becomes finding a story to listen to that is truly compelling, and that doesn't end too soon.

That said, Eleanor is absolutely right to point out that a medium which is only useful to commuters or joggers or hikers or swimmers -- and in particular, the people in these categories who are not deaf -- has limited reach.

Automated transcription will help expand its reach someday. But despite my recent success with Dragon NaturallySpeaking, we're not there yet. I made an 11.025KHz 16-bit mono PCM file out of part of my interview with Ron Owens and fed it to the program's auto-transcriber. The results were predictably awful. This stuff might play a near-term role as search bait. A lot of juicy keywords show up in the transcript, and if the text were sprinkled with random-access links into the audio, it could provide some kind of searchable index. But I doubt the benefit would outweigh the cost of loading up the search engines with a lot of otherwise useless garbage.

Phonetic indexing, described below, is a very different search strategy with greater chance of near-term success. But the best approach, it seems to me, is an approach that's more low-tech and Web-native. We need to make it easy for bloggers to quote from, and textually annotate, audio and video content. Here's the argument I laid out in the first installment of my Primetime Hypermedia column:

We normally assume that you can't search video and audio. In fact, there's technology in the pipeline that could turn that assumption on its head. In 2002, I reviewed (registration required) a revolutionary phonetic indexing system from Fast-Talk Communications (now Nexidia). Using a demo application based on this system, I was able to record phone interviews, index them in near-realtime, and then search them phonetically. For example, to find an occurrence of "MySQL" using this application, "my sequel" was an effective search term. I probably could have used "my seek well" instead. Rather than doing something really hard -- converting speech to text, then indexing the text -- the system takes a clever shortcut. It recognizes and indexes raw phonemes (the basic sounds of speech), translates your search term into phonemes, and searches accordingly.

It worked well enough for me to become a practical tool, not just a novelty. I don't know when Nexidia's (or comparable) technology will find its way into Google and its competitors, but sooner or later I expect it to radically transform our use of media content. For example, a Nokia presentation at JavaOne this year included a segment on web services middleware, focusing on a developer framework that simplifies access to a Liberty-based identity service. The whole video is 19 minutes long; this particular segment runs from 11:45 to 13:05. Suppose a search of all the JavaOne videos for "web services Liberty" yielded, near the top of a relevance-ranked list, a pointer to that segment within that stream. That's revolutionary, feasible, and coming soon -- I hope.

Even when we can search audio and video this way, though, I've concluded that text wrapped around segments within AV streams will be a potent way of finding those segments -- maybe the most potent. For the foreseeable future, text will be much more efficiently searchable than AV content. What's more, blogs together with text search engines form the nexus within which interesting bits of content are drawn to our attention. Conferences produce many hours of AV content that no one has time to consume. Buried within those hours of content are highlights that people will want to discuss and share. Bloggers today refer to those highlights, but they rarely link to them. When we do that, the Google dynamic can kick in. A few days after this article is published, I won't need to remember where I stashed the rtsp: URL that I used in the previous paragraph. I'll just need to be able to find the article in which I used that URL. So this very article will contribute a small piece of what I hope will become a massive index into the hypermedia Web. Blogs, in aggregate, will provide the bulk of that index.

How did my prediction fare? I was a bit optimistic. The query "web services" liberty won't return my column, with its embedded video URL, in Google's top 10. The query "web services" liberty "javaone" nokia video, however, will.

It's true that commuters and joggers don't need or want podcast transcripts, but it's disingenuous to suggest that nobody does. In a perfect world every audio and video file posted online would be associated with a high-quality transcript. The reality is that, for the kinds of interactive and fast-moving conversations that we are now capturing thanks to these media, that's not going to happen anytime soon. Despite that, the decentralized context engine known as the blogosphere will tend to surface the most interesting segments in textual form. Those textual annotations will aid the discovery of other segments, which themselves may be annotated and partly transcribed.

For example, this one-minute clip contains a funny bit that doesn't show up in the transcript:

Udell: Is part of what you do helping people figure out how to do the development of personalities?

Owens: Yes.

Udell: So you have psychologists and theatrical people on staff?

Owens: We have no psychologists. We do have some people that qualify as pretty theatrical.

It was a nice moment that humanized both of us and, as Eleanor points out in her commentary, marked a turning point in the conversation:

Udell really didn't get the value of the company at this point, and Owens was not doing a good job pitching it, as you will see the conversation takes a much different turn in the next exchange.

I'd forgotten about that moment. But when Eleanor reminded me of it, I was prompted to do a bit of transcription. Of course, anyone else could do the same. My hunch is that these small acts of linking and annotation will add up to something pretty powerful, and that this is how the blogosphere's natural tendency to filter and contextualize will gradually enfold audio and video content.

Former URL: http://weblog.infoworld.com/udell/2005/01/05.html#a1144