The Kestrel TTS text normalization system

PETER EBDEN; RICHARD SPROAT

doi:10.1017/S1351324914000175

The Kestrel TTS text normalization system

Published online by Cambridge University Press: 12 December 2014

PETER EBDEN and

RICHARD SPROAT

Show author details

PETER EBDEN: Affiliation:
Google, Inc (now at Thought Machine), London, UK email: pebden@google.com
RICHARD SPROAT: Affiliation:
Google, Inc, New York, USA email: rws@google.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper describes the Kestrel text normalization system, a component of the Google text-to-speech synthesis (TTS) system. At the core of Kestrel are text-normalization grammars that are compiled into libraries of weighted finite-state transducers (WFSTs). While the use of WFSTs for text normalization is itself not new, Kestrel differs from previous systems in its separation of the initial tokenization and classification phase of analysis from verbalization. Input text is first tokenized and different tokens classified using WFSTs. As part of the classification, detected semiotic classes – expressions such as currency amounts, dates, times, measure phases, are parsed into protocol buffers (https://code.google.com/p/protobuf/). The protocol buffers are then verbalized, with possible reordering of the elements, again using WFSTs. This paper describes the architecture of Kestrel, the protocol buffer representations of semiotic classes, and presents some examples of grammars for various languages. We also discuss applications and deployments of Kestrel as part of the Google TTS system, which runs on both server and client side on multiple devices, and is used daily by millions of people in nineteen languages and counting.

Type: Articles
Information: Natural Language Engineering , Volume 21 , Issue 3 , May 2015 , pp. 333 - 353

DOI: https://doi.org/10.1017/S1351324914000175 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S., 1996. Partial parsing via finite-state cascades. Natural Language Engineering 2 (4): 337–344.CrossRef Google Scholar

Aho, A., 1969. Nested stack automata. Journal of the Association for Computing Machinery 16 (3): 383–406.CrossRef Google Scholar

Allauzen, C., Mohri, M., and Riley, M. 2004. Statistical modeling for unit selection in speech synthesis. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’2004), pp. 55–62.Google Scholar

Allauzen, C., and Riley, M., 2012. A pushdown transducer extension for the OpenFst library. In Conference on Implementation and Application of Automata, Lecture Notes in Computer Science vol. 7381, Heidelberg: Springer, pp. 66–77.CrossRef Google Scholar

Allauzen, C., Riley, M., and Schalkwyk, J., 2011. Filters for efficient composition of weighted finite-state transducers. In Conference on Implementation and Application of Automata, Lecture Notes in Computer Science vol. 6482, Heidelberg: Springer, pp. 28–38.CrossRef Google Scholar

Allen, J., Hunnicutt, M. S., Klatt, D., Armstrong, R., and Pisoni, D. 1987. From Text to Speech: The MITalk System, Cambridge, England, UK: Cambridge University Press.Google Scholar

Bangalore, S., and Riccardi, G., 2001. A finite-state approach to machine translation. In 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, pp. 1–8.Google Scholar

Bird, S., and Ellison, T. M., 1994. One-level phonology: autosegmental representations and rules as finite automata. Computational Linguistics 20 (1): 55–90.Google Scholar

de Gispert, A., Iglesias, G., Blackwood, G., Banga, E., and Byrne, W., 2010. Hierarchical phrase-based translation with weighted finite-state transducers and shallow-n grammars. Computational Linguistics 36 (3): 505–533.CrossRef Google Scholar

Duchi, J., and Singer, Y. 2009. Boosting with structural sparsity. In Proceedings of the 26th International Conference on Machine Learning, Montreal, p. 297304.Google Scholar

Johnson, C. D. 1972. Formal Aspects of Phonological Description. Walter de Gruyter.CrossRef Google Scholar

Joshi, A., 1996. A parser from antiquity. Natural Language Engineering 2 (4): 291–294.CrossRef Google Scholar

Jurafsky, D., and Martin, J., 2009. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and speech recognition. 2nd edn.Pearson: Prentice Hall.Google Scholar

Kaplan, R. M., and Kay, M., 1994. Regular models of phonological rule systems. Computational Linguistics 20: 331–378.Google Scholar

Koskenniemi, K. 1983. Two-level morphology: a general computational model of word-form recognition and production. PhD thesis, University of Helsinki.CrossRef Google Scholar

Möbius, B., 2001. German and Multilingual Speech Synthesis. Phonetik AIMS: Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung vol. 7, Lehrstuhl für experimentelle Phonetik, Stuttgart.Google Scholar

Möbius, B., Sproat, R., van Santen, J., and Olive, J. 1997. The Bell Labs German text-to-speech system: an overview. In Eurospeech. Rhodes.CrossRef Google Scholar

Mohri, M. 2009. Weighted automata algorithms. In Droste, M., Kuich, W., and Vogler, H. (eds.) Handbook of Weighted Automata, Monographs in Theoretical Computer Science, Springer, pp. 213–254.Google Scholar

Mohri, M., Pereira, F. C. N., and Riley, M., 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language 16 (1): 69–88.CrossRef Google Scholar

Mohri, M., and Sproat, R. 1996. An efficient compiler for weighted rewrite rules. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 231–238.Google Scholar

Navigli, R., 2009. Word sense disambiguation: a survey. ACM Computing Surveys 41 (2): 169.CrossRef Google Scholar

Neubig, G., Nakata, Y., and Mori, S., 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Association for Computational Linguistics, Portland, OR, pp. 529–533.Google Scholar

Pereira, F., Riley, M., and Sproat, R., 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, Plainsboro, NJ, pp. 249–254.Google Scholar

Roark, B., Riley, M., Allauzen, C., Tai, T., and Sproat, R., 2012. The OpenGrm open-source finite-state grammar software libraries. In ACL, Jeju Island, Korea, pp. 61–66.Google Scholar

Roark, B., and Sproat, R., 2007. Computational Approaches to Morphology and Syntax. Oxford: Oxford University Press.Google Scholar

Roark, B., and Sproat, R., 2014. Hippocratic abbreviation expansion. In Association for Computational Linguistics, Baltimore, MD, pp. 364–369.Google Scholar

Skut, W., Ulrich, S., and Hammervold, K., 2003. A generic finite state compiler for tagging rules. Machine Translation 18 (3): 239–250.CrossRef Google Scholar

Skut, W., Ulrich, S., and Hammervold, K., 2004. A bimachine compiler for ranked tagging rules. In Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04, Association for Computational Linguistics, Geneva, Switzerland, pp. 198–204.CrossRef Google Scholar

Sproat, R., 1996. Multilingual text analysis for text-to-speech synthesis. Natural Language Engineering 2 (4): 369–380.CrossRef Google Scholar

Sproat, R. (ed.):, 1997. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Boston, MA: Springer.Google Scholar

Sproat, R., 2000. A Computational Theory of Writing Systems. Cambridge, England, UK: Cambridge University Press.Google Scholar

Sproat, R., 2010. Lightly supervised learning of text normalization: Russian number names. In IEEE Workshop on Spoken Language Technology, IEEE, Berkeley, CA, pp. 436–441.CrossRef Google Scholar

Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C., 2001. Normalization of non-standard words. Computer Speech and Language 15 (3): 287–333.CrossRef Google Scholar

Tai, T., Skut, W., and Sproat, R. 2011. Thrax: an open source grammar compiler built on OpenFst. In Automatic Speech Recognition and Understanding Workshop, Waikoloa Resort, Hawaii.Google Scholar

Taylor, P., 2009. Text to Speech Synthesis. Cambridge, England, UK: Cambridge University Press.CrossRef Google Scholar

Yarowsky, D. 1996. Homograph disambiguation in text-to-speech synthesis. In van Santen, J., Sproat, R., Olive, J., and Hirschberg, J. (eds.), Progress in Speech Synthesis, New York: Springer, pp. 157–172.Google Scholar

Article contents

The Kestrel TTS text normalization system

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests