[talks] Zeyu Jin will be presenting his Generals on May 19, 2015 at 2pm in CS 402.

Mon May 11 10:45:32 EDT 2015

Zeyu Jin will be presenting his Generals on May 19, 2015 at 2pm in CS 402.

The members of his committee are Adam Finkelstein (adviser), Tom Funkhouser, and Barbara Engelhardt.

Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam following are welcome to do so.  His abstract and reading list follow below.

Title
text-based editing for recorded narration

Abstract
Recorded audio narration plays a crucial role in many contexts including online lectures, documentaries, podcasts, and radio. Recording narration is easy. Moreover, the web and a proliferation of online media resources makes distribution increasingly easy as well. However, editing the audio remains relatively arduous, especially for non-experts. A professional recording studio staffed by a sound engineer can make a good narration for an online lecture sound great, for example making content-level edits such as changing the timing, correcting words and modify the prosody of the narration. But such tasks will not scale as we increasingly move lecture content online; the editing needs to be done by the person who makes the recording, who is rarely an expert audio engineer. Therefore, we propose a text-based editing system where a non-expert user can edit the audio data of a narration by manipulating the text of a transcript (as done in a text editor). Our solution starts with an audio recording and its transcript. Our first goal is to precisely align the words in the transcript to their corresponding regions in the audio waveform, using a variant of dynamic time warp to match vocal features. Next, in order to accelerate a section that is too slow, we devise a differential time-compressing algorithm for different kinds of sounds that preserves clarity relative to prior methods. For word insertion and modification, we devise a data-driven voice conversion method that concatenates small audio segments selected from the rest of the recording to approximate the target words, selected to match a computer-synthesized voice. A user can also use his own recording of the word to control the prosody of the concatenated words.

Textbooks

Lawrence Rabiner and Ronald Schafer, Theory and Application of Speech Processing, 2010

P. Taylor Text-to-Speech Synthesis, 2009 :Cambridge Univ. Press. Chapter 9-12 and 14-16

Papers

A. F. Machado and M. Queiroz. Voice conversion: A critical survey. In SMC, 2010.
https://www.ime.usp.br/~mqz/SMC2010_Voice.pdf

Y. Stylianou, “Voice transformation: A survey,” in Proc. Int. conference on acoustics, speech, and signal processing (ICASSP 2009), Taipei, Taiwan, April 2009, pp. 3585–3588
http://www.cs.cmu.edu/~pmuthuku/mlsp_page/lectures/Stylianou_VC.pdf

K. Fujii, J. Okawa, and K. Suigetsu, Highindividuality voice conversion based on concatenative speech synthesis. Proceedings of World Academy of Science: Engineering & Technology 36 (2007).
http://waset.org/publications/1272/high-individuality-voice-conversion-based-on-concatenative-speech-synthesis

J. Lu, F. Yu, A. Finkelstein, and S. DiVerdi. HelpingHand: Example-based Stroke Stylization. ACM Transactions on Graphics (Proc. SIGGRAPH) 31(4):46:1-46:10, August 2012.
http://gfx.cs.princeton.edu/gfx/pubs/Lu_2012_HES/Lu_2012_HES.pdf

A. de Cheveigné and H. Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111:1917, 2002.
http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf

B. Kulis, Metric Learning A Survey, Foundations and Trends in Machine Learning Vol.5 No.4, 2012 287-364
http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf

K. Lent, An Efficient Method for Pitch Shifting Digitally Sampled Sounds, Computer Music Journal, Vol. 13, No. 4 (Winter, 1989), pp. 65-71
http://www.jstor.org/discover/10.2307/3679554?uid=3739560&uid=2129&uid=2&uid=70&uid=4&uid=3739256&sid=21103178755863

S. Hoffmann and B. Pfister, Text-to-Speech Alignment of Long Recordings Using Universal Phone Models, Interspeech 2013
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=612866

C.J. Leggetter, P.C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech & Language, Volume 9, Issue 2, April 1995, Pages 171-185
http://www.sciencedirect.com/science/article/pii/S0885230885700101

S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li, and M. Agrawala. 2013. Content-based tools for editing audio stories. In Proceedings of the 26th annual ACM symposium on User interface software and technology (UIST '13). ACM, New York, NY, USA, 113-122
https://ccrma.stanford.edu/~gautham/Site/Publications_files/rubin-uist2013.pdf