Jiaqi Su will present her FPO "Studio-quality Speech Enhancement" on Thursday, May 5, 2022 at 3:30 PM in Friend 125 and Zoom.
Location: Zoom link: https://princeton.zoom.us/j/6073737365 & Friend 125
The members of Jiaqi’s committee are as follows:
Examiners: Adam Finkelstein (Adviser), Olga Russakovsky, Karthik Narasimhan
Readers: Szymon Rusinkiewicz, Zeyu Jin (Adobe Research)
A copy of her thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.
Everyone is invited to attend her talk.
Abstract follows below:
Modern speech content such as podcasts, video narrations, and audiobooks typically requires high-quality audio to support a strong sense of presence and a pleasant listening experience. However, real-world recordings captured with consumer-grade equipment often suffer from quality degradations including noise, reverberation, equalization distortion, and loss of bandwidth. Conventional speech enhancement methods typically focus on removing a subset of quality degradations with the goal of improving signal clarity and intelligibility, but they fall short of studio quality. This dissertation addresses speech enhancement with a focus on improving the perceptual quality and aesthetics of recorded speech. It describes how to improve single-channel real-world consumer-grade recordings to sound like professional studio recordings – studio-quality speech enhancement. In pursuit of this problem, we identify three challenges: objective functions misaligned with human perception, the shortcomings of commonly used audio representations (i.e., spectrogram and waveform), and the lack of available high-quality speech data for training.
This dissertation presents a waveform-to-waveform deep neural network solution that consists of two steps: (1) enhancement by removing all quality degradations at limited bandwidth (i.e., 16kHz sample rate), and (2) bandwidth extension from 16kHz to 48kHz to produce a high-fidelity signal. The first enhancement stage relies on a perceptually-motivated GAN framework that combines both waveform and spectrogram representations, and learns from simulated data covering a broad range of realistic recording scenarios. Next, the bandwidth extension stage shares a similar design as the enhancement method, but focuses on filling in missing high frequency details at 48kHz. Finally, we extend the studio-quality speech enhancement problem to a more general problem called acoustic matching to convert recordings to an arbitrary acoustic environment.