[talks] Zeyu Jin will present his Pre FPO, "Speech Synthesis for Text-Based Editing of Audio Narrations" on Tuesday, December 5th, 2017 at 2pm in CS 302.

Nicki Gotsis ngotsis at CS.Princeton.EDU
Thu Nov 30 13:40:34 EST 2017


Zeyu Jin will present his Pre FPO, "Speech Synthesis for Text-Based Editing of Audio Narrations" on Tuesday, December 5th, 2017 at 2pm in CS 302.  

The members of his committee are as follows:
Advisor: Adam Finkelstein
Readers: Thomas Funkhouser, Gautham Mysore (Stanford)
Non-readers: Marshini Chetty, Szymon Rusinkiewicz

Everyone is invited to attend his talk.  The talk title and abstract follow below.


Title: Speech Synthesis for Text-Based Editing of Audio Narrations

Abstract:
Editing audio narration using conventional software typically involves
many painstaking low-level manipulations. Some state of the art
systems allow the editor to work in a text transcript of the
narration, and perform select, cut, copy and paste operations directly
in the transcript; these operations are then automatically applied to
the waveform in a straightforward manner. However, an obvious gap in
the text-based interface is the ability to type new words not
appearing in the transcript, for example inserting a new word for
emphasis or replacing a misspoken word. While high-quality voice
synthesizers exist today, the challenge is to synthesize the new word
in a voice that matches the rest of the narration. This work focuses
on synthesizing a new word or short phrase such that it blends
seamlessly in the context of the existing narration.

My thesis explores two strategies for voice synthesis in context. Our
initial approach is to use a text to speech synthesizer to say the
word in a generic voice, and then use data-driven voice conversion to
convert it into a voice that matches the narration. Offering a range
of degrees of control to the editor, our interface supports fully
automatic synthesis, selection among a candidate set of alternative
pronunciations, fine control over edit placements and pitch profiles,
and even guidance by the editors own voice. Our experiments show that
60% of the time, the synthesized words are indistinguishable from real
recordings. The goal of our second approach is to improve that success
rate using a deep learning model to synthesize the waveform directly
based on low level acoustic features.


More information about the talks mailing list