Nicki Gotsis ngotsis at CS.Princeton.EDU
Mon May 11 10:45:32 EDT 2015

Zeyu Jin will be presenting his Generals on May 19, 2015 at 2pm in CS 402.

The members of his committee are Adam Finkelstein (adviser), Tom Funkhouser, and Barbara Engelhardt.

Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam following are welcome to do so.  His abstract and reading list follow below.

text-based editing for recorded narration

Recorded audio narration plays a crucial role in many contexts including online lectures, documentaries, podcasts, and radio. Recording narration is easy. Moreover, the web and a proliferation of online media resources makes distribution increasingly easy as well. However, editing the audio remains relatively arduous, especially for non-experts. A professional recording studio staffed by a sound engineer can make a good narration for an online lecture sound great, for example making content-level edits such as changing the timing, correcting words and modify the prosody of the narration. But such tasks will not scale as we increasingly move lecture content online; the editing needs to be done by the person who makes the recording, who is rarely an expert audio engineer. Therefore, we propose a text-based editing system where a non-expert user can edit the audio data of a narration by manipulating the text of a transcript (as done in a text editor). Our solution starts with an audio recording and its transcript. Our first goal is to precisely align the words in the transcript to their corresponding regions in the audio waveform, using a variant of dynamic time warp to match vocal features. Next, in order to accelerate a section that is too slow, we devise a differential time-compressing algorithm for different kinds of sounds that preserves clarity relative to prior methods. For word insertion and modification, we devise a data-driven voice conversion method that concatenates small audio segments selected from the rest of the recording to approximate the target words, selected to match a computer-synthesized voice. A user can also use his own recording of the word to control the prosody of the concatenated words.


