A fast and flexible architecture for very large word n-gram datasets

MICHAEL FLOR

doi:10.1017/S1351324911000349

A fast and flexible architecture for very large word n-gram datasets

Published online by Cambridge University Press: 10 January 2012

MICHAEL FLOR

Show author details

MICHAEL FLOR*: Affiliation:
NLP and Speech Group, Educational Testing Service, Princeton, NJ 08541, USA e-mail: mflor@ets.org

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel trie-based architecture, features lossless compression, and provides optimization for both speed and memory use. In addition to literal queries, it also supports fast pattern matching searches (with wildcards or regular expressions), on the same data structure, without any additional indexing. Language models are updateable directly in the compiled binary format, allowing rapid encoding of existing tabulated collections, incremental generation of n-gram models from streaming text, and merging of encoded compiled files. This architecture offers flexible choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model, or on-demand partial data loading with very modest memory requirements. The implemented system runs successfully on several different platforms, under different operating systems, even when the n-gram model file is much larger than available memory. Experimental evaluation results are presented with the Google Web1T collection and the Gigaword corpus.

Type: Articles
Information: Natural Language Engineering , Volume 19 , Issue 1 , January 2013 , pp. 61 - 93

DOI: https://doi.org/10.1017/S1351324911000349 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bentley, J. L. and Sedgewick, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'97), New Orleans, LA, USA, pp. 360–9. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.Google Scholar

Bentley, J. L. and Sedgewick, R. 1998. Ternary search trees. Dr Dobb's Journal, April 01, http://drdobbs.com/windows/184410528. Accessed 8 April 2011.Google Scholar

Bergsma, S., Lin, D. and Goebel, R. 2009. Web-scale n-gram models for lexical disambiguation. In Proceedings of 2009 International Joint Conference on Artificial Intelligence (IJCAI 2009), Pasadena, CA, USA, pp. 1507–12. Palo Alto, CA: AAAI Press.Google Scholar

Bloom, B. H. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7): 422–6.CrossRef Google Scholar

Brants, T. and Franz, A. 2006. Web 1T 5-gram Version 1. Philadelphia, PA, USA: Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 Google Scholar

Brants, T., Popat, A. C., Xu, P., Och, F. J. and Dean, J. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing (EMNLP-CoNLL 2007), Prague, Czech Republic, pp. 858–67. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Carlson, A. and Fette, I. 2007. Memory-based context-sensitive spelling correction at web scale. In Proceedings of the Sixth International Conference on Machine Learning and Applications (ICMLA '07), Cincinnati, OH, USA, pp. 166–71. Los Alamitos, CA, USA: IEEE Computer Society Press.Google Scholar

Ceylan, H. and Mihalcea, R. 2011. An efficient indexer for large n-gram corpora. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), System Demonstrations, Portland, OR, USA, pp. 103–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Chazelle, B., Kilian, J., Rubinfeld, R., and Tal, A. 2004. The Bloomier filter: an efficient data structure for static support lookup tables. Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2004), New Orleans, LA, USA, pp. 30–9. Philadelphia, PA: Society for Industrial and Applied Mathematics.Google Scholar

Church, K., Hart, T. and Jianfeng, G. 2007, Compressing trigram language models with Golomb coding. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp. 199–207. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Emami, A., Papineni, K. and Sorensen, J. 2007. Large-scale distributed language modeling. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP-2007, Honolulu, HI, USA, vol. 4, pp. 37–40. Piscataway, NJ: Institute of Electrical and Electronics Engineers.Google Scholar

Evert, S. 2010. Google Web 1T 5-grams made easy (but not for the computer). In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (WAC-6), Los Angeles, CA, USA, pp. 32–40. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Federico, M. and Cettolo, M. 2007. Efficient handling of n-gram language models for statistical machine translation. In Proceedings of the ACL 2007 Workshop on Statistical Machine Translation, Prague, Czech Republic, pp. 88–95. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Fredkin, E. 1960. Trie memory. Communications of the ACM 3 (9): 490–9.CrossRef Google Scholar

Futagi, Y. 2010. The effects of learner errors on the development of a collocation detection tool. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data (AND '10), Toronto, Canada, pp. 27–34. New York, NY: Association for Computing Machinery.CrossRef Google Scholar

Germann, U., Joanis, E. and Larkin, S. 2009. Tightly packed tries: how to fit large models into memory, and make them load fast, too. In Proceedings of the NAACL HLT Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009), Boulder, CO, USA, pp. 31–9. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Giuliano, C. 2007. jWeb1T: a library for searching the Web 1T 5-gram corpus. Software available at http://hlt.fbk.eu/en/technology/jWeb1t. Accessed 8 April 2011.Google Scholar

Graff, D. and Cieri, C. 2003. English Gigaword. Philadelphia, PA, USA: Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05 Google Scholar

Guthrie, D. and Hepple, M. 2010a. Minimal perfect hash rank: compact storage of large n-gram language models. In Proceedings of SIGIR 2010 Web N-gram Workshop, Geneva, Switzerland, pp. 21–9. New York, NY: Association for Computing Machinery.Google Scholar

Guthrie, D. and Hepple, M. 2010b. Storing the Web in memory: space efficient language models with constant time retrieval. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP-2010), Boston, MA, USA, pp. 262–72. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Harb, B., Chelba, C., Dean, J. and Ghemawat, S. 2009. Back-off language model compression. In Proceedings of 10th Annual Conference of the International Speech Communication Association (Interspeech 2009), Brighton, UK, pp. 325–55. International Speech Communication Association, ISCA archive http://www.isca-speech.org/archive/interspeech_2009. Accessed 8 April 2011.Google Scholar

Hawker, T., Gardiner, M. and Bennetts, A. 2007. Practical queries of a massive n-gram database. In Proceedings of the Australasian Language Technology Workshop 2007 (ALTW 2007), Melbourne, Australia, pp. 40–8. Australasian Language Technology Association.Google Scholar

Heafield, K. 2011. KenLM: faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation, pp. 187–97, Edinburgh, Scotland, UK. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Islam, A. and Inkpen, D. 2009a. Managing the Google Web 1T 5-gram data set. In Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE 2009), Dalian, China, pp. 1–5. Piscataway, NJ: Institute of Electrical and Electronics Engineers.Google Scholar

Islam, A. and Inkpen, D. 2009b. Real-word spelling correction using Google Web 1T n-gram data set. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, pp. 1689–92. New York, NY: Association for Computing Machinery.Google Scholar

Islam, A. and Inkpen, D. 2009c. Real-word spelling correction using Google Web 1T n-gram with backoff. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'09), Dalian, China, pp. 1–8. Piscataway, NJ: Institute of Electrical and Electronics Engineers.Google Scholar

Lapata, M. and Keller, F. 2005. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing 2 (1): 1–31.CrossRef Google Scholar

Levenberg, A. and Osborne, M. 2009. Stream-based randomised language models for SMT. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP-2009), Singapore, pp. 756–64. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Manning, C. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.Google Scholar

Pauls, A. and Klein, D. 2011. Faster and smaller n-gram language models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland, OR, USA, pp. 258–67. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Raj, B. and Whittaker, E. W. D. 2003. Lossless compression of language model structure and word identifiers. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'03), Hong Kong. USA, vol. 1, pp. 388–99. Piscataway, NJ: Institute of Electrical and Electronics Engineers.Google Scholar

Ravishankar, M. 1996. Efficient Algorithms for Speech Recognition. PhD thesis, Technical Report. CMU-CS-96-143, Carnegie Mellon University, Pittsburgh, PA, USA.Google Scholar

Sekine, S. 2008. Linguistic knowledge discovery tool: very large n-gram database search with arbitrary wildcards. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08), Manchester, UK, pp. 181–4. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Sekine, S. and Dalwani, K., 2010. N-gram search engine with patterns combining token, POS, chunk and NE information. In Proceedings of Language Resource and Evaluation Conference (LREC-2010), Valletta, Malta, pp. 2682–6. Paris, France: European Language Resources Association.Google Scholar

Stolcke, A. 2002. SRILM – An extensible language modeling toolkit. In Proceedings of 7th International Conference on Spoken Language Processing (ICSLP2002 – INTERSPEECH 2002), Denver, CO, USA, pp. 901–4. International Speech Communication Association, ISCA archive, http://www.isca-speech.org/archive/icslp02 Google Scholar

Talbot, D. 2009. Succinct approximate counting of skewed data. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 2009), Pasadena, CA, USA, pp. 1243–8. Palo Alto, CA: AAAI Press.Google Scholar

Talbot, D. and Brants, T. 2008. Randomized language models via perfect hash functions. In Proceedings of 46th Annual Meeting of the Association for Computational Linguistics and Human Language Technology Conference (ACL-08: HLT), Columbus, OH, USA, pp. 505–13. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Talbot, D. and Osborne, M. 2007a. Randomised language modelling for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 512–19. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Talbot, D. and Osborne, M. 2007b. Smoothed Bloom filter language models: tera-scale LMs on the cheap. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), Prague, Czech Republic, pp. 468–76. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Van Durme, B., and Lall, A. (2009). Probabilistic counting with randomized storage. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 2009), Pasadena, CA, USA, pp. 1574–9. Palo Alto, CA: AAAI Press.Google Scholar

Wang, K. and Li, X. 2009. Efficacy of a constantly adaptive language modeling technique for web-scale applications. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'2009), Taipei, Taiwan, pp. 4733–6. Piscataway, NJ: Institute of Electrical and Electronics Engineers.Google Scholar

Watanabe, T., Tsukada, H. and Isozaki, H. 2009. A succinct n-gram language model. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), Short Papers, Suntec, Singapore, pp. 341–4. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Whittaker, E. W. D. and Raj, B. 2001. Quantization-based language model compression. In Proceedings of 7th European Conference on Speech Communication and Technology (EUROSPEECH'01), Aalborg, Denmark, pp. 33–6. International Speech Communication Association, ISCA archive, http://www.isca-speech.org/archive/eurospeech_2001 CrossRef Google Scholar

Yuret, D. 2008. Smoothing a tera-word language model. In Proceedings of 46th Annual Meeting of the Association for Computational Linguistics and Human Language Technology Conference (ACL-08: HLT), Columbus, OH, USA, pp. 141–4. Stroudsburg, PA: Association for Computational Linguistics. Software available at http://denizyuret.blogspot.com/2008/06/smoothing-tera-word-language-model.html Google Scholar

Zens, R. and Ney, H. 2007. Efficient phrase-table representation for machine translation with applications to online MT and speech translation. In Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT-2007), Rochester, NY, USA, pp. 492–9. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Article contents

A fast and flexible architecture for very large word n-gram datasets

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests