Speaker Recognition and TalkPrinting
Speaker Recognition and TalkPrinting
Project Summary
Standard approaches to automatic speaker recognition use
spectrum-related features based on very short time slices of speech.
Models based on such information suffer from a lack of robustness to
channel mismatches, and fail to capture longer-range characteristics of
how a person talks, including the speaker's word patterns, and patterns in
speech prosody (the timing, pausing, and intonation of speech). The
goal of our project is to discover "TalkPrint" features -- features
that capture these habitual variations in speaking style, and to model
them in conjunction with standard features to improve automatic speaker
One core technical challenge in this work is to design long-range
features (which by definition occur less frequently than very
short-range features) that provide robust additional information even
for short (e.g., 30 seconds) training and test spurts of speech. A
second crucial challenge area is to develop methods for feature
selection and model combination at the feature level, that can cope
with large numbers of interrelated features, odd feature space
distributions, inherent missing features (such as pitch when a person
is not voicing), and heterogeneous feature types. A third issue is how
to employ TalkPrint features successfully for a new language and
across languages, since traditional speech recognition and derived
features are inherently language-dependent. We are also investigating
how to make TalkPrinting systems more robust to channel
Recently we developed a series of novel techniques for speaker
modeling, both in the stylistic and the acoustic realm, as well as a
new method for model combination. Many of these techniques leverage a
tight integration with our prosody
modeling and large-vocabulary speech
recognition efforts. We have evaluated our techniques against
state-of-the-art speaker recognition systems in the annual
NIST Speaker Recognition Evaluations,
with excellent results.
Here is a web-based overview of SRI's unique
speaker verification process modeled after the NIST task.
Murat Akbacak
Harry Bratt
Lukas Burget
Luciana Ferrer
Yun Lei
Martin Graciarena
Nicolas Scheffer
Elizabeth Shriberg
Andreas Stolcke
A. Stolcke, M. Graciarena, & L. Ferrer (2012),
Effects of Audio and ASR Quality on Cepstral and High-level Speaker
Verification Systems,
Proc. Odyssey Speaker and Language Recognition Workshop,
pp. 298-303, Singapore.
A. Stolcke, A. Mandal, & E. Shriberg (2012),
Speaker Recognition With Region-Constrained MLLR Transforms,
Proc. IEEE ICASSP, pp. 4397-440, Kyoto.
N. Scheffer, Y. Lei, & L. Ferrer (2011),
Factor analysis back ends for MLLR transforms in speaker recognition,
Proc. Interspeech, pp. 257-260, Florence.
M. Kockmann, L. Ferrer, L. Burget, & J. H. Cernockı,
iVector fusion of prosodic and cepstral features for speaker verification,
Proc. Interspeech, pp. 265-268, Florence.
M. H. Sanchez, L. Ferrer, E. Shriberg, & A. Stolcke (2011),
Constrained Cepstral Speaker Recognition Using Matched UBM and JFA Training,
Proc. Interspeech, pp. 737-740, Florence.
M. Akbacak, D. Vergyri, A. Stolcke, N. Scheffer, & A. Mandal (2011),
Effective Arabic Dialect Classification Using Diverse Phonotactic Models,
Proc. Interspeech, pp. 737-740, Florence.
N. Scheffer, L. Ferrer, M. Graciarena, S. Kajarekar, E. Shriberg & A. Stolcke (2011),
The SRI NIST 2010 Speaker Recognition Evaluation System,
Proc. IEEE ICASSP, pp. 5292-5295, Prague.
E. Shriberg & A. Stolcke (2011),
Language-independent constrained cepstral features for speaker recognition,
Proc. IEEE ICASSP, pp. 5296-5299, Prague.
M. Kockmann, L. Ferrer, L. Burget, E. Shriberg, & J. Cernocky (2011),
Recent Progress in Prosodic Speaker Verification,
Proc. IEEE ICASSP, pp. 4556-4559, Prague.
M. Graciarena, M. Delplanche,E. Shriberg & A. Stolcke (2011),
Bird Species Recognition Combining Acoustic and Sequence Modeling,
Proc. IEEE ICASSP, pp. 341-344, Prague.
A. Stolcke, M. Akbacak, L. Ferrer, S. Kajarekar, C. Richey, N. Scheffer, & E. Shriberg (2010),
Improving Language Recognition with Multilingual Phone Recognition and
Speaker Adaptation Transforms,
Proc. Odyssey Speaker and Language Recognition Workshop,
Brno, Czech Republic, pp. 256-262.
L. Ferrer, N. Scheffer, & E. Shriberg (2010),
A Comparison of Approaches for Modeling Prosodic Features in Speaker Recognition,
Proc. IEEE ICASSP, Dallas, Texas, pp. 4414-4417.
S. S. Kajarekar (2010),
Across-phone Variability and Diagonal Term in Joint Factor Analysis
Proc. IEEE ICASSP, Dallas, Texas, pp. 4406-4409.
N. Scheffer & R. Vogt (2010),
On the Use of Speaker Superfactors for Speaker Recognition,
Proc. IEEE ICASSP, Dallas, Texas, pp. 4410-4413.
L. Ferrer, K. Sonmez, & E. Shriberg (2009).
An anticorrelation kernel for subsystem training in multiple classifier systems,
Journal of Machine Learning Research, Vol. 10, pp. 2079-2114.
E. Shriberg, S. Kajarekar, N. Scheffer (2009).
Does Session Variability Compensation in Speaker Recognition
Model Intrinsic Variation Under Mismatched Conditions?,
Proc. Interspeech, Brighton, UK, pp. 1551-1554.
M. Graciarena, T. Bocklet, E. Shriberg, A. Stolcke, S. Kajarekar (2009).
Feature-Based and Channel-Based Analyses of Intrinsic Variability in Speaker
Proc. Interspeech, Brighton, UK, pp. 2015-2018.
S. S. Kajarekar, N. Scheffer, M. Graciarena, E. Shriberg, A. Stolcke,
L. Ferrer, & T. Bocklet (2009),
The SRI NIST 2008 Speaker Recognition Evaluation System,
Proc. IEEE ICASSP, Taipei, pp. 4205-4208.
T. Bocklet and E. Shriberg (2009),
Speaker Recognition Using Syllable-Based Constraints for Cepstral Frame Selection ,
Proc. ICASSP, Taipei, Taiwan, pp. 4525-4528.
S. S. Kajarekar, L. Ferrer, A. Stolcke, & E. Shriberg (2008),
Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features,
in N. K. Ratha & V. Govindaraju (eds.),
Advances in Biometrics: Sensors, Algorithms and Systems,
pp. 183-201, Springer, London.
E. Shriberg, M. Graciarena, H. Bratt, A. Kathol, S. Kajarekar, H. Jameel,
C. Richey, & F. Goodman (2008),
Effects of Vocal Effort and Speaking Style on Text-Independent Speaker
Proc. Interspeech, pp. 609-612, Brisbane, Australia.
L. Ferrer (2008),
Modeling Prior Belief for Speaker Verification SVM Systems,
Proc. Interspeech, pp. 1385-1388, Brisbane, Australia.
E. Shriberg & A. Stolcke (2008),
The Case for Automatic Higher-Level Features in Forensic Speaker
Proc. Interspeech, pp. 1509-1512, Brisbane, Australia.
S. S. Kajarekar (2008),
Phone-based Cepstral Polynomial SVM System for Speaker Recognition,
Proc. Interspeech, pp. 845-848, Brisbane, Australia.
L. Ferrer, M. Graciarena, A. Zymnis, & E. Shriberg (2008),
System Combination Using Auxiliary Information for Speaker Verification,
Proc. IEEE ICASSP, pp. 4853-4857, Las Vegas.
A. Stolcke, S. Kajarekar, & L. Ferrer (2008),
Nonparametric Feature Normalization for SVM-based Speaker Verification,
Proc. IEEE ICASSP, pp. 1577-1580, Las Vegas.
E. Shriberg, L. Ferrer, S. Kajarekar, N. Scheffer, A. Stolcke,
& M. Akbacak (2008),
Detecting Nonnative Speech Using Speaker Recognition Approaches.
Proc. Odyssey Speaker and Language Recognition Workshop,
Stellenbosch, South Africa.
L. Ferrer, K. Sonmez, & E. Shriberg (2008),
An Anticorrelation Kernel for Improved System Combination in
Speaker Verification.
Proc. Odyssey Speaker and Language Recognition Workshop,
Stellenbosch, South Africa.
A. Stolcke & S. Kajarekar (2008),
Recognizing Arabic Speakers with English Phones.
Proc. Odyssey Speaker and Language Recognition Workshop,
Stellenbosch, South Africa.
A. Stolcke, S. Kajarekar, L. Ferrer, & E. Shriberg (2007),
Speaker Recognition with Session Variability Normalization Based on
MLLR Adaptation Transforms,
IEEE Transactions on Audio, Speech, and Language Processing,
15(7), 1987-1998.
Special issue on speaker and language recognition.
E. Shriberg & L. Ferrer (2007),
A Text-Constrained Prosodic System for Speaker Verification,
Proc. Interspeech/Eurospeech, pp. 1226-1229, Antwerp.
L. Ferrer, K. Sonmez, and E. Shriberg (2007),
A Smoothing Kernel for Spatially Related Features and Its Application to
Speaker Verification,
Proc. Interspeech/Eurospeech, pp. 738-741, Antwerp.
G. Tur, E. Shriberg, A. Stolcke, & S. Kajarekar (2007),
Duration and Pronunciation Conditioned Lexical Modeling for Speaker Verification
Proc. Interspeech/Eurospeech, pp. 2049-2052, Antwerp.
S. Kajarekar & A. Stolcke (2007),
NAP and WCCN: Comparison of Approaches Using MLLR-SVM Speaker Verification
vol. 4, pp. 249-252, Honolulu, Hawaii.
L. Ferrer, E. Shriberg, S. Kajarekar, & K. Sonmez (2007),
Parameterization of Prosodic Feature Distributions for SVM Modeling
in Speaker Recognition,
vol. 4, pp. 233-236, Honolulu, Hawaii.
M. Graciarena, S. Kajarekar, A. Stolcke, E. Shriberg (2007),
Noise Robust Speaker Identification for Spontaneous Arabic Speech,
vol. 4, pp. 245-248, Honolulu, Hawaii.
A. Stolcke, E. Shriberg, L. Ferrer, S. Kajarekar, K. Sonmez, & G. Tur (2007),
Speech Recognition as Feature Extraction for Speaker Recognition,
Proc. SAFE 2007: Workshop on Signal Processing Applications for Public
Security and Forensics,
pp. 39-43, Washington, D.C.
A. O. Hatch, S. Kajarekar, & A. Stolcke (2006),
Within-Class Covariance Normalization for SVM-based Speaker Recognition.
Proc. ICSLP, pp. 1471-1474, Pittsburgh.
S. S. Kajarekar, H. Bratt, E. Shriberg, & R. de Leon (2006),
A Study of Intentional Voice Modifications for
Evading Automatic Speaker Recognition.
Proc. IEEE Odyssey 2006 Speaker and Language Recognition Workshop,
San Juan, Puerto Rico.
A. Stolcke, L. Ferrer, & S. Kajarekar (2006),
Improvements in MLLR-Transform-based Speaker Recognition.
Proc. IEEE Odyssey 2006 Speaker and Language Recognition Workshop,
pp. 1-6, San Juan, Puerto Rico.
L. Ferrer, E. Shriberg, S. S. Kajarekar, A. Stolcke, K. Sonmez,
A. Venkataraman, & H. Bratt (2006),
The Contribution of Cepstral and Stylistic Features to SRI's 2005 NIST
Speaker Recognition Evaluation System.
Proc. IEEE ICASSP, vol. 1, pp. 101-104, Toulouse.
A. O. Hatch and A. Stolcke (2006),
Generalized Linear Kernels for One-Versus-All Classification:
Application to Speaker Recognition.
Proc. IEEE ICASSP, vol. 5, pp. 585-588, Toulouse.
S. S. Kajarekar (2005),
Four Weightings and a Fusion: A Cepstral-SVM System for Speaker Recognition.
Proc. IEEE Speech Recognition and Understanding Workshop,
pp. 17-22, San Juan, Puerto Rico.
A. O. Hatch, A. Stolcke, & B. Peskin (2005),
Combining Feature Sets with Support Vector Machines:
Application to Speaker Recognition.
Proc. IEEE Speech Recognition and Understanding Workshop,
pp. 75-79, San Juan, Puerto Rico.
E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, & A. Stolcke (2005),
Modeling Prosodic Feature Sequences for Speaker Recognition.
Speech Communication 46(3-4), 455-472.
Special Issues on Quantitative Prosody Modelling for Natural Speech
Description and Generation.
A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, & A. Venkataraman (2005),
MLLR Transforms as Features in Speaker Recognition.
Proc. Eurospeech, Lisbon, pp. 2425-2428.
L. Ferrer, K. Sonmez, & S. Kajarekar (2005),
Class-dependent Score Combination for Speaker Recognition.
Proc. Eurospeech, Lisbon, pp. 2173-2176.
S. S. Kajarekar, L. Ferrer, E. Shriberg, K. Sonmez, A. Stolcke,
A. Venkataraman, & J. Zheng (2005),
SRI's 2004 NIST Speaker Recognition Evaluation System,
Proc. IEEE ICASSP, Philadelphia, vol. 1, pp. 173-176.
A. O. Hatch, B. Peskin, & A. Stolcke (2005),
Improved Phonetic Speaker Recognition Using Lattice Decoding,
Proc. IEEE ICASSP, Philadelphia, vol. 1, pp. 169-172.
E. Shriberg, L. Ferrer, A. Venkataraman, & S. Kajarekar (2004),
SVM Modeling of ``SNERF-Grams'' for Speaker Recognition.
Proc. Intl. Conf. on Spoken Language Processing,
pp. 1409-1412, Jeju, Korea.
S. Kajarekar, L. Ferrer, K. Sonmez, J. Zheng, E. Shriberg,
& A. Stolcke (2004),
Modeling NERFs for Speaker Recognition.
Proc. Odyssey 04 Speaker and Language Recognition Workshop,
pp. 51-56, Toledo, Spain.
S. Kajarekar, L. Ferrer, A. Venkataraman, K. Sonmez, E. Shriberg, A. Stolcke,
& R. R. Gadde (2003),
Speaker Recognition using Prosodic and Lexical Features.
Proc. IEEE Speech Recognition and Understanding Workshop,
pp. 19-24, St. Thomas, U.S. Virgin Islands.
L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. Sonmez,
A. Stolcke, & A. Venkataraman (2003),
Modeling Duration Patterns for Speaker Recognition.
Proc. Eurospeech,
pp. 2017-2020, Geneva.
S. Kajarekar, K. Sonmez, L. Ferrer, V. Gadde, A. Venkataraman, E. Shriberg,
A. Stolcke, & H. Bratt (2003),
"TalkPrinting": Improving Speaker Recognition by Modeling Stylistic Features
Intelligence and Security Informatics.
First NSF/NIJ Symposium, ISI 2003,
Springer Lecture Notes in Computer Science Series,
Volume 2665,
H. Chen, R. Miranda, D.D. Zeng, C. Demchak, J. Schroeder, T. Madhusudan,
editors, pp. 350-354.
© 2003 Springer-Verlag.
K. Sonmez, L. Heck, & M. Weintraub (2000),
Multiple Speaker Tracking and Detection: Handset Normalization and Duration
Digital Signal Processing, 10(1/2/3), 133-143.
L. P. Heck, Y. Konig, M. K. Sonmez, & M. Weintraub (2000),
Robustness to Telephone Handset Distortion in Speaker Recognition by
Discriminative Feature Design,
Speech Communication, 31(2-3), 181-192.
K. Sonmez, L. Heck & M. Weintraub (1999),
Speaker Tracking and
Detection with Multiple Speakers,
Proc. Eurospeech, vol. 5, pp. 2219-2222, Budapest.
H. Murthy, F. Beaufays, L. P. Heck, & M. Weintraub (1999),
Robust Text-Independent Speaker Identification over Telephone Channels,
IEEE Trans. on Speech and Audio Processing 7(5), 554-568.
K. Sonmez, E. Shriberg, L. Heck & M. Weintraub (1998),
Dynamic Prosodic Variation for Speaker Verification,
Proc. Intl. Conf. on Spoken Language Processing,
vol. 7, pp. 3189-3192, Sydney.
Y. Konig, L. Heck, M. Weintraub, & K. Sonmez, (1998),
Discriminant Feature Extraction for Robust Text-Independent Speaker
Proc. RLA2C-ESCA Speaker Recognition and its Commercial and Forensic
pp. 72-75, Avignon, France.
L. Heck & Y. Konig (1998),
Training of Minimum Cost Speaker Verification Systems,
Proc. RLA2C-ESCA Speaker Recognition and its Commercial and Forensic
pp. 93-96, Avignon, France.
K. Sonmez, L. Heck, M. Weintraub & E. Shriberg (1997),
lognormal tied mixture model of pitch for prosody-based speaker recognition,
Proc. Eurospeech, vol. 3, pp. 1391-1394, Rhodes, Greece.
L. Julia, L. P. Heck, & A. Cheyer (1997),
A Speaker Identification Agent,
Proc. AVBPA'97, Crans Montana, Switzerland.
L. P. Heck & M. Weintraub (1997),
Handset-Dependent Background Models for Robust Text-Independent Speaker
Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing,
vol. 2, pp. 1071-1074, Munich.
F. Beaufays & M. Weintraub (1997),
Model Transformation for Robust Speaker Recognition from Telephone Data,
Proc. IEEE Intl. Conf. on Acoustic, Speech, and Signal Processing,
vol. 2, pp. 1063-1066, Munich.
L. P. Heck & J. H. McClellan (1993),
Subspace Techniques for Large-Scale Feature Selection,
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing,
volume 4, pp. 17-20, Minneapolis.
D. A. Reynolds & L. P. Heck (1991),
Integration of Speaker and Speech Recognition Systems,
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing,
pp. 869-872, Toronto.
L. Ferrer, M. Graciarena, S. Kajarekar, N. Scheffer, E. Shriberg, &
A. Stolcke,
The SRI NIST SRE10 Speaker Verification System,
NIST Speaker Recognition Evaluation Workshop, June 24, 2010,
Brno, Czech Republic.
A. Stolcke, lectures given at the
Winter School in Speech and Audio Processing (WiSSAP'09),
IIT Kanpur, India, January 9-12, 2009:
Higher-Level Features for Speaker Recognition
Phonetic Speaker Recognition
MLLR Transform and Constrained Cepstral Modeling
A. Stolcke,
Machine Learning for Speaker Recognition,
NIPS Workshop on
Speech and Language: Learning-based Methods and Systems,
Dec. 12, 2008, Whistler, B.C.
M. Graciarena, S. Kajarekar, N. Scheffer, E. Shriberg, A. Stolcke,
L. Ferrer, & T. Bocklet,
The SRI NIST SRE08 Speaker Verification System,
NIST Speaker Recognition Evaluation Workshop, June 16, 2008, Montreal.
S. Kajarekar, L. Ferrer, M. Graciarena, E. Shriberg, K. Sönmez,
A. Stolcke, G. Tur, & A. Venkataraman,
SRI’s NIST 2006 Speaker Recognition Evaluation System,
NIST Speaker Recognition Evaluation Workshop, June 2006, San Juan, PR.