Announcement is updated to include Zoom link for virtual attendees.
From: lriehl@cs.princeton.edu On Behalf Of
gradinfo--- via talks
Sent: Wednesday, May 28, 2025 2:10 PM
To: 'talks'
Subject: [talks] Yunyun Wang will present her FPO "Generative Universal
Models for Speech" on Friday, May 30, 2025 at 2:00 PM in CS 302.
Yunyun Wang will present her FPO "Generative Universal Models for Speech" on
Friday, May 30, 2025 at 2:00 PM in CS 302 & Zoom.
Zoom link:
https://princeton.zoom.us/j/9195627075?pwd=a1N5a2VzYy92cEFMVjFTZUZnOGpiUT09
The members of Yunyun's committee are as follows:
Examiners: Adam Finkelstein (Adviser), Szymon Rusinkiewicz, Felix Heide
Readers: Danqi Chen, Zeyu Jin (Adobe Research)
A copy of her thesis is available upon request. Please email
gradinfo@cs.princeton.edu mailto:gradinfo@cs.princeton.edu if you would
like a copy of the thesis.
Everyone is invited to attend her talk.
Abstract follows below:
This thesis presents a comprehensive framework for controllable speech
synthesis through self-supervised generative modeling. We propose Generative
Universal Models for Speech (GUMS), a system that decomposes speech into
disentangled representations-speaker embeddings, acoustic embeddings, and
content representations, and reconstructs it using a synthesis model. Our
approach enables detailed control over speaker voice, environmental
acoustics, speech content, and speaking rate.
We introduce three key representation models. First, GR0 learns global
speaker embeddings by disentangling them from time-varying local content
without requiring speaker labels. Second, we develop content representation
models AIC and GUMS Codec that capture speech content in continuous and
quantized forms, respectively. The AIC model enforces speaker and pitch
invariance through the alteration invariant content loss. GUMS Codec builds
on the speech codec model DAC, incorporating residual vector quantization
along with speaker and pitch conditioning. The result is a highly compact,
discrete, and language-independent representation that is well-suited for
manipulation, control, and efficient transmission.
We then integrate these representations into a high-fidelity speech
synthesis model, DiTVC, based on a Diffusion Transformer architecture. DiTVC
enables direct prompting using target speaker audio instead of relying on
fixed embeddings, allowing for more expressive voice conversion and robust
prosody control. By combining these models, we achieve controllable,
high-quality speech synthesis using unlabeled, in-the-wild data. The unified
framework advances both representation learning and generation, offering an
interpretable, and editable approach to speech synthesis.