Jiaqi Su will present his Pre FPO "Studio Quality Speech Enhancement" on Thursday, December 16, 2021 at 3pm via Zoom


Zoom Link

https://princeton.zoom.us/j/6073737365

 

Committee:

Examiners: Adam Finkelstein (advisor), Olga Russakovsky, Karthik Narasimhan

Readers: Szymon Rusinkiewicz, Zeyu Jin


Abstract:

Modern speech content creation tasks such as podcasts, video voice-overs, and audio books require studio-quality audio with full bandwidth and balanced equalization. However, real-world recordings captured in natural spaces with consumer-grade devices suffer from noise, reverberation, equalization distortion and bandwidth limitations. The goal of cleaning up such recordings poses a challenge for conventional speech enhancement methods, which typically focus on removing strong acoustic degradations so as to improve speech clarity and intelligibility, but fall short of studio quality. In this talk, I will present an end-to-end deep learning approach for transforming recorded speech to sound as though it had been recorded in a studio. The approach relies on a multi-domain multi-scale GAN, coupled with deep feature matching in the discriminators, to model the perception of sound quality. It also incorporates a separate pre-trained recurrent neural network to predict the acoustic features of the clean target from those of noisy input, which serves as a guide to generate the clean audio. This combined approach incorporates the effectiveness of acoustic features in modeling human perception of speech, while retaining the benefits of waveform-to-waveform conversion. I will first show the model in the context of speech enhancement at 16kHz, and then discuss how to adapt the approach to bandwidth extension for filling in missing high frequency details at 48kHz. These two components placed together form a full studio-quality speech enhancement pipeline for real-world recordings. Finally, with access to a studio quality version of the recording, we can optionally repurpose components of this pipeline to adapt the recording so that it matches the acoustic qualities of a specific target environment.