Speech Technology and Research (STAR) Laboratory Seminar Series
Past talks: 2010
Abstract: n this talk, we present results on applying a personality assessment paradigm to speech input, and compare human and automatic performance on this task. We cue a professional spe aker to produce speech using different personality profiles and encode the resulting vocal personality impressions in terms of the Big Five NEO-FFI personality traits. We then ha ve human raters, who do not know the speaker, estimate the five factors. We analyze the recordings using signal-based acoustic and prosodic methods and observe high consistency b etween the acted personalities, the ratersâssessments, and initial automatic classification results. We further validate the application of our paradigm to speech input, and extend it towards text independent speech. We show that human labelers can consistently label speech data generated across multiple recording sessions with respect to personality, and investigate further which of the 5 scales in the NEO-FFI scheme can be assessed from speech, and ho w a manipulation of one scale influences the perception of another. Finally, we present a top-down clustering of human labels of personality traits derived from speech, which will be useful in future experiments on automatic classification of per sonality traits. This presents a first step towards being able to handle personality traits in speech, which we envision will be used in future voice-based communication between humans and machines
Florian Metze is research faculty at Carnegie MellonâLanguage Technologies Institute, working on fundamental problems in speech recognition and understanding, user inter faces and related areas. He is also associate director of the InterACT center at Carnegie Mellon. He holds a PhD from Universität Karlsruhe (TH), for a thesis on âticulatory fe atures for conversational speech recognitionâHis main research focus is on techniques for acoustic modeling for speech recognition and multimedia analysis. Currently, he is wor king on discriminative techniques for sub-phonetic and multi-lingual modeling, as well as the extraction of paralinguistic information (emotions, personality, etc.) from speech, with a goal of using this information in user interfaces, or for data mining. He is one of the authors and maintainers of the Ibis decoder and the Janus speech recognition toolki t.
Abstract: In this talk, I present recent case studies that highlight the potential for multimedia retrieval of online data to support real-world attacks. Multimedia Retrieval, i.e., the task of matching and comparing multimedia content across databases, has rapidly emerged as a field with highly useful applications in many different domains. Serious efforts in this area can be traced back to the early 1990s when devices such as digital cameras and camera phones, combined with progress in compression technology and availability of Internet connectivity, significantly changed people's lives. This rapid technological progress created a strong demand for organizing and accessing multimedia data automatically. Consequently, researchers from different areas of computer science, including computer vision, speech processing, natural language processing, Semantic Web, and databases, invested significant effort into the development of convenient and efficient retrieval mechanisms that target different types of audio and video data from large, and potentially remote, databases. Technologies developed include speech recognition, face recognition, speaker identification, visual object retrieval, and many others. While retrieval speed, flexibility, and accuracy are still research problems, this talk will demonstrate that they are not the only ones. This talk aims to raise awareness for a rapidly emerging privacy threat that we termed "cybercasing": leveraging information available online to mount real-world attacks. Based on the initial example of geo-tagging, I will show that while users typically realize that sharing information, e.g., on social networks, has some implications for their privacy, many users 1) are unaware of the full scope of the threat they face when doing so, and 2) often do not even realize when they publish such information. The threat is elevated by recent developments that make systematic search for information and inference from multiple sources easier than ever before. However, even with relatively high error rates, multimedia retrieval techniques can be used effectively for different real-world attacks by using "lop-sided" tuning; for example by favoring low false alarm rates over high hit rates when scanning for potential victims to attack. This talk presents a set of scenarios demonstrating how easy it is to correlate data, especially those based on location information, with corresponding publicly available information for compromising a victim's privacy.
Gerald Friedland is a senior research scientist at the International Computer Science Institute, a private lab affiliated with the University of California, Berkeley, where he leads multimedia content analysis research, mostly focusing on acoustic techniques such as speaker diarization and acoustic event detection. He is currently ICSIs PI on the IARPA ALADDIN project on large scale video event detection together with CMU, Sarnoff, Cycorp Inc, UMass, and University of Central Florida, Co-PI of an NSF-funded project "Understanding and Managing the Impact of Global Inference on Online Privacy", and ICSIs PI on a project on multimodal location detection (funded by NGA) together with UC Berkeley.. He is also a member of the Executive Advisory Board of UC Berkeley's Opencast project. Until 2009 he was site manager in the EU-funded project AMIDA and the Swiss-funded IM2 project, both of which explored multimedia meeting analysis. He co-founded the IEEE International Conference on Semantic Computing and is a proud founder and program director of the IEEE International Summer School on Semantic Computing at UC Berkeley. Dr. Friedland has published more than 100 peer-reviewed articles in conferences, journals, and books and is currently authoring a new textbook on multimedia computing together with Dr. Ramesh Jain. He is associate editor for ACM Transactions on Multimedia Computing, Communications, and Applications and He is also in the organization committee of ACM Multimedia 2011. He is the recipient of several research and industry recognitions, among them the European Academic Software Award and the Multimedia Entrepreneur Award by the German Federal Department of Economics. Most recently, he lead the team that won the ACM Multimedia Grand Challenge in 2009. Dr. Friedland received his doctorate (summa cum laude) and master's degree in computer science from Freie Universitaet Berlin, Germany, in 2002 and 2006, respectively.
Abstract: Mimicking the efficiency and robustness by which the human brain represents information remains a core challenge in artificial intelligence research. Recent neuroscience findings have provided insight into the principles governing information representation in the mammal brain. This discovery motivated the emergence of the subfield of deep machine learning (DML), which focuses on computational models for information representation that exhibit similar characteristics to that of the neocortex. DML offers the ability to effectively process high-dimensional data that may exhibit broad temporal dependencies. This is achieved by employing hierarchical architectures that learn to capture salient spatiotemporal features based on regularities in the observations. In the context of speech processing, DML has particular relevance in delivering rich features that can enhance applications such as speech recognition and speaker biometrics. In this talk, I will review recent results and chart the future of DML as a field that is bound to have great impact on many areas pertaining to machine learning and intelligent control.
Itamar Arel is an Associate Professor in the Department of Electrical Engineering and Computer Science at The University of Tennessee, where he directs the Machine Intelligence Lab. He is a member of the Center for Intelligent Systems and Machine Learning at the University of Tennessee, co-founder of the Biologically-inspired Cognitive Architectures (BICA) Society and a senior member of IEEE. His research focus is on high-performance machine learning architectures and algorithms, with emphasis on deep learning architectures, reinforcement learning and decision making under uncertainty. Dr. Arel is a recipient of the US Department of Energy Early Career Principal Investigator (CAREER) award in 2004. He holds a B.S., M.S and Ph.D. degrees in Electrical and Computer Engineering and an M.B.A. degree, all from Ben-Gurion University in Israel.
Abstract: The magnitude spectrum of any audio signal may be viewed as a density function or (in the case of discrete frequency spectra) histograms with the frequency axis as the support. In this talk I will describe how this perspective allows us to perform spectral decompositions through a latent-variable model that enables us to extract underlying, or "latent", spectral structures that additively compose the speech spectrum. I show how such decomposition can be used for varied purposes such as bandwidth expansion of narrow-band speech, component separation from mixed monaural signals, and denoising. I then explain how the basic latent-variable model may be extended to derive sparse overcomplete decompositions of speech spectra and describe how the model simply to example based representations and representations that conform to other compositional constraints. I demonstrate through examples that such decompositions can not only be utilized for improved speaker separation from mixed monaural recordings, but also to extract the building blocks of other data such as images and text. Finally, I present shift- and transform-independent extensions of the model, through which it becomes possible to automatically extract repeating themes within sounds, and show how this can be applied to problems of dereverberation, and pitch tracking. I time permits, I will also talk about newer developments addressing relationships between non-negativity and independence, as well as the use of attractive and repulsive priors, that can be used to guide decompositions, with simulations and results.
Dr. Bhiksha Raj is an Associate Professor in the Language Technologies Institute at Carnegie Mellon University. Dr Raj received his PhD also Carnegie Mellon University in 2000. Prior to joining CMU in 2008, he led the effort on speech and audio processing at Mitsubishi Electric Research Labs in Cambridge, MA. Dr. Raj's research spans the areas of robust automatic speech recognition, microphone array processing, audio processing, computer audition and machine learning.