Yunyun Wang will present her General Exam "Controllable Speech Representation Learning via Voice Conversion" on Wednesday, May 11, 2022 at 1:30 pm via Zoom

Zoom Link: https://princeton.zoom.us/j/9195627075?pwd=a1N5a2VzYy92cEFMVjFTZUZnOGpiUT09

Committee Members: Adam Finkelstein (advisor), Karthik Narasimhan, Olga Russakovsky

Abstract:

Speech representation learning transforms speech into features that are suitable for downstream tasks, such as speech recognition, phoneme classification or speaker identification. For such recognition tasks, a representation can be lossy (non-invertible), which is typical of BERT-like, self-supervised models. However, when used for synthesis tasks, we find these lossy representations prove to be insufficient to plausibly reconstruct the input signal. This paper introduces a method for invertible and controllable speech representation learning based on disentanglement. The representation can be decoded into a signal perceptually identical to the original. Moreover, its disentangled components (speech content, pitch, speaker identity and energy) can be controlled independently to alter the synthesis result. Our model builds upon a zero-shot model called AutoVC-F0 trained for voice conversion – where the goal is to modify an audio recording containing the voice of one speaker so that the identity sounds like that of another speaker without altering the speech content. Building on AutoVC-F0, we introduce alteration invariant content loss (AIC loss) as well as adversarial training (GAN). Through objective measures and subjective tests, we show that our formulation offers significant improvement in voice conversion sound quality as well as more precise control over the disentangled features.

Reading List:

https://docs.google.com/document/d/1YTGrKke9HYHa7bxtNiyTyHpArrwSEXulPmrb_DwISos/edit?usp=sharing

Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.