Zeyu Jin will present his Pre FPO, "Speech Synthesis for Text-Based Editing of Audio Narrations" on Tuesday, December 5th, 2017 at 2pm in CS 302.
![](https://secure.gravatar.com/avatar/9d4a00facedd23758daa6e1d1bb321b6.jpg?s=120&d=mm&r=g)
Zeyu Jin will present his Pre FPO, "Speech Synthesis for Text-Based Editing of Audio Narrations" on Tuesday, December 5th, 2017 at 2pm in CS 302. The members of his committee are as follows: Advisor: Adam Finkelstein Readers: Thomas Funkhouser, Gautham Mysore (Stanford) Non-readers: Marshini Chetty, Szymon Rusinkiewicz Everyone is invited to attend his talk. The talk title and abstract follow below. Title: Speech Synthesis for Text-Based Editing of Audio Narrations Abstract: Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This work focuses on synthesizing a new word or short phrase such that it blends seamlessly in the context of the existing narration. My thesis explores two strategies for voice synthesis in context. Our initial approach is to use a text to speech synthesizer to say the word in a generic voice, and then use data-driven voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. Our experiments show that 60% of the time, the synthesized words are indistinguishable from real recordings. The goal of our second approach is to improve that success rate using a deep learning model to synthesize the waveform directly based on low level acoustic features.
participants (1)
-
Nicki Gotsis