CS Colloquium Speaker
Speaker: Bryan Pardo, Northwestern University
Date: Monday, October 7
Time: 12:30pm EST
Location: CS 105
Host: Adam Finkelstein
Event page: https://www.cs.princeton.edu/events/26714
Register for live-stream online here: https://princeton.zoom.us/webinar/register/WN_TZ_YrTZyRu-Amzcg5uFp-Q

Title: The Future is Hear: Innovations from the Interactive Audio Lab

Abstract: The Interactive Audio Lab, headed by Bryan Pardo, works at the intersection of machine learning, signal processing and human-computer interaction. The lab invents new tools to generate, modify, find, separate, and label sound. In this talk, Prof. Pardo will discuss three projects illustrative of the work in the lab: 

Text2FX: Audio effects (e.g., equalization, reverberation, compression) are a cornerstone of modern audio production. However, their complex and unintuitive controls (e.g., decay, cutoff frequency) make them challenging for non-technical musicians, podcasters and sound artists. As people naturally describe sound in terms like `bright' or `warm,' natural language can serve as a more intuitive and accessible way to navigate the complex parameter spaces of audio effects. Text2FX leverages a shared audio-text embedding space (CLAP) and differentiable digital signal processing (DDSP) to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., “make it sound in-your-face and bold”).

VampNet: In recent years, advances in discrete acoustic token modeling have resulted in significant leaps in autoregressive generation of speech and music. Meanwhile, approaches that use non-autoregressive parallel iterative decoding have been developed for efficient image synthesis. In this work, we combine parallel iterative decoding with acoustic token modeling and apply them to music audio synthesis. The resulting model, VampNet is fast enough for interactive performance and can be prompted by music audio prompts, making it well suited for creating loops and variational accompaniment in artistic contexts. 

VoiceBlock: Deep-learning-based speaker recognition systems can facilitate mass surveillance, allowing search for a target speaker through thousands of concurrent voice communications. In this work, we propose a highly-effective approach to anonymize speech to an automated speaker recognition system, while leaving the voice perceptually unaltered to a human listener.  Because our method does not conceal speaker identity from human listeners, it still allows high-effort targeted surveillance (e.g. authorized human-attended wiretaps of criminal enterprises), while making mass automated surveillance significantly less reliable. In this way, we hope to return to the status quo of the 20th and early 21st centuries – in which the need for human listeners provided an important check on mass surveillance.

Bio: Bryan Pardo studies fundamental problems in computer audition, content-based audio search, and generative modeling of audio, and also develops inclusive interfaces for audio production. He is head of Northwestern University’s Interactive Audio Lab and co-director of the Northwestern University Center for HCI+Design. Prof. Pardo has appointments in the Department of Computer Science and Department of Radio, Television and Film. He received a M. Mus. in Jazz Studies in 2001 and a Ph.D. in Computer Science in 2005, both from the University of Michigan. He has authored over 140 peer-reviewed publications. He has developed speech analysis software for the Speech and Hearing department of the Ohio State University, statistical software for SPSS and worked as a machine learning researcher for General Dynamics. His patented technologies have been productized by companies including Bose, Adobe, Lexi, and Ear Machine. While finishing his doctorate, he taught in the Music Department of Madonna University. When he is not teaching or researching, he performs on saxophone and clarinet with the bands Son Monarcas and The East Loop.