Speech Technology and Research (STAR) Laboratory Seminar Series
Past talks: 2007
Abstract:
Nasalization refers to the process of speech production in which significant amounts of airflow and sound energy are transmitted through the nasal cavity. Phonetically, nasalization is essential for certain phonemes to be produced in most languages; and it can also be a normal consequence of coarticulation. In some disordered speech, however, inappropriate nasalization could be one of the causes that reduces the intelligibility of speech. Instrumental measurement and analysis techniques are needed for better understanding the relationship between the physiological status and the acoustic effects of nasalization during speech. Automatic detection of different oral-nasal articulatory configurations during speech is useful for both understanding normal nasalization and assessing certain speech disorders. We propose an approach to extract nasalization features from dual-channel acoustic signals that are acquired by a simple two-microphone setup. The feature is derived from a dual-channel acoustic model and the associated analysis method. In the talk, I will explain the dual-channel acoustic model, the derivation of the analysis method, the simulation experiment with an articulatory synthesizer, and how features are extracted from dual-channel acoustic data. Comparison experimental results of classification tasks will be presented to show the advantage of the dual-channel analysis method. Potential applications will also be discussed.
Abstract:
I will describe an approach to improve Statistical Machine Translation performance using multi-lingual, parallel, sentence-aligned corpora in several bridge languages. I will start with an overview of the statistical translation system at Google. I will then pr esent our method to utilize a bridge language to create a word alignment system and outline a procedure for combining word alignmen t systems from multiple bridge languages. The final translation is obtained by consensus decoding that combines hypotheses obtained using all bridge language word alignments. I will present experiments showing that multilingual, parallel text in Spanish, French, Russian, and Chinese can be utilized in this framework to improve translation performance on an Arabic-to-English task. This is jo int work with Franz Och and Wolfgang Macherey in the language translation team at Google.
Abstract:
We are proposing two new approaches for measuring adaptation between dialogs. These approaches permit measurement of adaptation both to conversational partn er (partner adaptation) and to the local dialog context (recency adaptation), and can be used with different types of feature. We used these measures to stu dy adaptation in the Maptask corpus of spoken dialogs. We show that for syntactic features, recency adaptation is stronger than partner adaptation; however , we find no significant differences for lexical adaptation using these measures.
Demo:
Interactive Calendar Management Tool. We will present a calendar management tool using a natural language (optionally spoken) interaction. This system allow s the user to create, modify and remove calendar events, query for events, and hear descriptions of events. In our demonstration we will focus on two aspects of the RavenCalendar platform: its flexible approach to language understanding and dialog management, and its multimodal interface. RavenCalendar is a multimodal dialog system built around the Google Calendar and Google Maps Web applications.
Abstract:
Speech recognition for agglutinative and highly inflectional languages is a challenging task. In agglutinative languages, many new words can be derived from a single stem by concatenation of several suffixes. This suffixation process causes the word vocabulary to expand significantly. Therefore, moderate size recognition vocabularies result in large number of Out-of-Vocabulary (OOV) words. The easiest way to handle OOV problem is to increase the vocabulary size, however, huge vocabulary sizes will suffer from non-robust language model estimates. For that reason, sub-words are investigated as language modeling units. Although, sub-words alleviate those problems, they introduce new challenges like over generated and ungrammatical sub-word sequences.This research is an attempt to solve the main problems of both the word and sub-word approaches for agglutinative languages. Proposed techniques are explored on Turkish newspaper content and broadcast news (BN) transcription tasks. Our novel approaches, lexical form sub-word units and correction of sub-word sequences, improve the baseline performance. At the end of the talk, demonstration videos of our BN transcription and retrieval systems will be shown.
Abstract:
We present a new approach for keyword spotting, which is not based on HMMs. Unlike previous approaches, the proposed method employs a discri minative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most commo n measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech ut terance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting wh ole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training a keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach. Further experiments using the TIMIT trained model, but testing on the WSJ dataset shows that without further training our method outperforms the conventional HMM-based approach.
Abstract:
Speech databases lack efficient interfaces to explore information along time. We introduce an interactive timeline that helps the user in browsing an audio stream on a large time scale and recontextualize targeted information. Time can be explored at different granularities using synchronized scales. We try to take advantage of automatic transcription to generate a conceptual structure of the database. The timeline is annotated with two elements to reflect the information distribution relevant to a user need. Information density is computed using an information retrieval model and displayed as a continuous shade on the timeline whereas anchorage points are expected to provide a stronger structure and to guide the user through his exploration. These points are generated using an extractive summarization algorithm. We present a prototype implementing the interactive timeline to browse broadcast news recordings.
Abstract:
Charisma, the ability to lead by virtue of personality alone, is difficult to define but relatively easy to identify. However, cultural factors clearly affect perceptions of charisma. In this talk we compare results from parallel perception studies investigating native speaker judgments of the charismatic status of speech in Palestinian Arabic and Standard American English. We examine acoustic/prosodic and lexical correlates of charisma ratings to determine how the two cultures differ with respect to their views of charisma. (This is joint work with Fadi Biadsy and Andrew Rosenberg, with special thanks also to Wisam Dakka.)
Abstract:
Hierarchical phrase-based models for statistical machine translation were introduced by David Chiang in 2005, at the University of Maryland. They represent a middle ground between translation models based on surface strings, which ignore linguistic structure but successfully capture many statistical regularities, and syntactic models of translation, which rely on treebank-style structures provided by monolingual parsers. In this talk, I will review the basics of the hierarchical phrase-based approach and describe several new developments leading to improvements in translation of spoken language input and morphologically rich source languages. (Grad student contributors to the work in this talk include Chris Dyer, Nitin Madnani, and Adam Lopez.)
Abstract:
Statistical machine translation (SMT) systems learn probabilistic translation patterns from word-aligned bilingual text. If the bilingual text is parsed, we can learn learn int ere sting translation rules that are sensitive to syntactic structure. These rules can then be used to translate previously-unseen sentences --in t he talk, we give some results fro m t he ISI syntax-based translation system. We further find that we can improve translation accuracy if we manipulate the training trees and alignm ents. We describe those techniqu es and demonstrate better translation results.
Abstract:
Some of our current efforts are focused on the extension of traditional phrase-based statistical machine translation models to include additional annotation, may it be linguistic markup or automatically generated word classes. We integrate this information in an approach called "factored translation models". One problem that we face with these richer statistical translation models is effective parameter estimation. This is part of the motivation for another focus of our current work: the discriminative training of translation models using millions of features.
Abstract:
This talk presents a way to perform speaker adaptation for automatic speech recognition using the stream weights in a multi-stream setup, which included acoustic models for "Articulatory Features" (AFs) such as "Rounded" or "Voiced". We present supervised speaker adaptation experiments on a spontaneous speech task and compare the above stream-based approach to conventional approaches, in which the models, and not stream combination weights, are being adapted. In the approach we present, stream weights model the importance of features such as "Voiced" for word discrimination, which offers a descriptive interpretation of the adaptation parameters. We also present results on ASR using AFs on the RT-04S "Meeting" task.
Florian Metze got first interested in speech recognition while living abroad (Scotland and France), realizing that understanding speech is not something that happens automatically, as he had previously assumed. After having received a Diploma in Theoretical Physics from the Ludwig-Maximilians-Universität (LMU) München, also his hometown, in December 1998, he spent a few months with Mercer Management Consulting in München, then moved to Karlsruhe, pursuing doctoral studies with Prof. Waibel at the Interactive Systems Laboratories (ISL). During that time, he worked on several German and European Research Projects (Verbmobil, Nespole!, FAME, TC-Star, CHIL) and participated in NIST evaluations (RT-03 CTS, RT-04 Meeting). His doctoral thesis, defended in December 2005, was on "Articulatory Features for Conversational Speech Recognition". He is now with Deutsche Telekom Laboratories in Berlin, acting as a link between Research and Innovation teams in the area of "Usability" with a focus on speech recognition and related topics such as translation, multi-modal interfaces, and meta-data extraction.
Abstract:
In commercial applications based on speech recognition, the quality of the human-computer interaction is still far from being effective both from the user and service provider side. To improve the effectiveness and user acceptance of automated dialogue systems it is necessary to advance the state of the art of Spoken Language Understanding (SLU) research along two directions. First, SLU models have to be tightly coupled with the upstream (Automatic Speech Recognition: ASR) and downstream (Dialog Management: DM) processes.Second, SLU models have to be part of an adaptive component whose parameters are updated on-line based on the outcome of dialog strategies, a-priori or a-posteriori knowledge. In the framework of the European project LUNA, this talk presents the SLU strategies developed at the University of Avignon in these two directions. These strategies are evaluated on two kinds of corpora: an "academic" dialog corpus collected through the French evaluation program Technolangue/MEDIA, and a large "realistic" dialog corpus made of system logs of a widely deployed Spoken Dialog System by France-Telecom R&D (France Telecom 3000 Voice Agency service). This talk will discuss the differences between these two kinds of corpora and the new opportunities for academic research offered by these very large dialog corpora obtained through deployed Spoken Dialog Systems.
Frederic Bechet obtained his PhD in computer science in 1994 and is a Professor Assistant at the University of Avignon (France) since 1995. He's been an invited professor for one year at AT&T Research Lab in Florham Park, New Jersey, USA, from August 2001 until September 2002, working within the How May I Help You? research project. Frederic Bechet is the author/coauthor of over 40 refereed papers in journals and international conferences. His main research interests are:
- Spoken Language Understanding
- Language Models
- Shallow parsing (Chunking, POS tagging, Named Entity tagging)
- Linguistic aspects of Text-to-Speech synthesis
Abstract:
Speech summarization technology, which extracts important information and removes irrelevant information from speech, is expected to play an important role in building speech archives and improving the efficiency of spoken document retrieval. However, speech summarization has a number of significant challenges that distinguish it from general text summarization. Fundamental problems with speech summarization include speech recognition errors, disfluencies, and difficulties of sentence segmentation. Typical speech summarization systems consist of speech recognition, sentence segmentation, sentence extraction, and sentence compaction components. Most research up to now has focused on sentence extraction, using LSA (Latent Semantic Analysis), MMR (Maximal Marginal Relevance), or feature-based approaches, among which no decisive method has yet been found. Proper sentence segmentation is also essential to achieve good summarization performance. How to objectively evaluate speech summarization results is also an important issue. Several measures, including families of SumACCY and ROUGE measures, have been proposed, and correlation analyses between subjective and objective evaluation scores have been performed. Although these measures are useful for ranking various summarization methods, they do not correlate well with human evaluations, especially when spontaneous speech is targeted.