Extraction of multi-word expressions from small parallel corpora

YULIA TSVETKOV; SHULY WINTNER

doi:10.1017/S1351324912000101

Extraction of multi-word expressions from small parallel corpora

Published online by Cambridge University Press: 21 March 2012

YULIA TSVETKOV and

SHULY WINTNER

Show author details

YULIA TSVETKOV: Affiliation:
Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA, USA e-mail: yulia.tsvetkov@gmail.com
SHULY WINTNER: Affiliation:
Department of Computer Science University of Haifa, Hafia, Israel e-mail: shuly@cs.haifa.ac.il

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional way. We introduce a simple algorithm that proposes MWE candidates based on such misalignments, relying on 1:1 alignments as anchors that delimit the search space. We use a large monolingual corpus to rank and filter these candidates. Evaluation of the quality of the extraction algorithm reveals significant improvements over naïve alignment-based methods. The extracted MWEs, with their translations, are used in the training of a statistical machine translation system, showing a small but significant improvement in its performance.

Type: Articles
Information: Natural Language Engineering , Volume 18 , Issue 4 , October 2012 , pp. 549 - 573

DOI: https://doi.org/10.1017/S1351324912000101 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Al-Haj, H. February 2010. Hebrew Multiword Expressions: Linguistic Properties, Lexical Representation, Morphological Processing, and Automatic Acquisition. Master's thesis, University of Haifa, Haifa, Israel.Google Scholar

Al-Haj, H., and Wintner, S. 2010. Identifying multi-word expressions by leveraging morphological and syntactic idiosyncrasy. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 2010, pp. 10–18. http://www.aclweb.org/anthology/C10-1002 Google Scholar

Baldwin, T., Bannard, C., Tanaka, T., and Widdows, D. 2003. An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 Workshop on Multiword expressions, Sapporo, Japan, pp. 89–96. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar

Baldwin, T., and Tanaka, T. July 2004. Translation by machine of complex nominals: getting it right. In Tanaka, T., Villavicencio, A., Bond, F., and Korhonen, A. (eds.), Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, pp. 24–31. Stroudsburg, PA, USA: Association for Computational Linguistics.CrossRef Google Scholar

Bannard, C, Baldwin, T, and Lascarides, A. 2003. A statistical approach to the semantics of verb-particles. In Bond, D. M. F., Korhonen, A., and Villavicencio, A. (eds.), Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 65–72. http://www.aclweb.org/anthology/W03-1809.pdf CrossRef Google Scholar

Bar-Haim, R., Sima'an, K., and Winter, Y. June 2005. Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, MI, USA, pp. 39–46. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W05/W05-0706 CrossRef Google Scholar

Bird, S., Klein, E., and Loper, E. 2009. Natural Language Processing with Python. Sebastopol, CA: O'Reilly Media.Google Scholar

Bouma, G. 2009. Normalized (pointwise) mutual information in collocation extraction. In Von der Form zur Bedeutung: Texte Automatisch Verarbeiten/From Form to Meaning: Processing Texts Automatically, Proceedings of the Biennial GSCL Conference 2009, Tübingen: Gunter Narr Verlag, pp. 31–40.Google Scholar

Brants, T., and Franz, A. 2006. Web 1T 5-gram version 1.1. LDC Catalog No. LDC2006T13. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 Google Scholar

Carpuat, M., and Diab, M. 2010. Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, June 2010, pp. 242–5. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/N10-1029 Google Scholar

Caseli, H., Villavicencio, A., Machado, A., and Finatto, M. J. 2009. Statistically driven alignment-based multiword expression identification for technical domains. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, August 2009, pp. 1–8. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W09/W09-2901 Google Scholar

Chang, B., Danielsson, P., and Teubert, W. 2002. Extraction of translation unit from Chinese-English parallel corpora. In Proceedings of the first SIGHAN Workshop on Chinese Language Processing, Morristown, NJ, USA, pp. 1–5. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dx.doi.org/10.3115/1118824.1118825 Google Scholar

Church, K. W., and Hanks, P. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16 (1):22–9. .Google Scholar

Cook, P., Fazly, A., and Stevenson, S. 2007. Pulling their weight: exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the ACL Workshop on A Broader Perspective on Multiword Expressions (MWE 2007), Prague, Czech Republic, June 2007, pp. 41–8. Stroudsburg, PA, USA: ACL.CrossRef Google Scholar

Daille, B. 1994. Approche Mixte Pour L'extraction Automatique de Terminologie: Statistiques Lexicales et Filtres Linguistiques. PhD thesis, Université Paris, Paris, France.Google Scholar

Dejean, H., Gaussier, E., Goutte, C., and Yamada, K. 2003. Reducing parameter space for word alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts, Morristown, NJ, USA, pp. 23–6. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dx.doi.org/10.3115/1118905.1118910.Google Scholar

Doucet, A., and Ahonen-Myka, H. 2004. Non-contiguous word sequences for information retrieval. In Tanaka, T., Villavicencio, A., Bond, F., and Korhonen, A. (eds.), Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, July 2004, pp. 88–95. Stroudsburg, PA, USA: Association for Computational Linguistics.CrossRef Google Scholar

Erman, B., and Warren, B. 2000. The idiom principle and the open choice principle. Text 20 (1):29–62.Google Scholar

Fellbaum, C. (ed.) 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: Language, Speech and Communication, MIT Press.CrossRef Google Scholar

Graff, D., and Cieri, C. 2007. English Gigaword, 3rd. ed. LDC Catalog No. LDC2007T07. Philadelphia, PA, USA: Linguistic Data Consortium.Google Scholar

Itai, A., and Wintner, S. March 2008. Language resources for Hebrew. Language Resources and Evaluation 42 (1):75–98.CrossRef Google Scholar

Jackendoff, R. 1997. The Architecture of the Language Faculty. Cambridge, MA, USA: MIT Press.Google Scholar

Katz, G., and Giesbrecht, E. 2006. Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, July 2006, pp. 12–19. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W06/W06-1203 CrossRef Google Scholar

Kirschenbaum, A., and Wintner, S. 2010. A general method for creating a bilingual transliteration dictionary. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, May 2010, pp. 273–6. Paris, France: European Language Resources Association (ELRA). ISBN 2-9517408-6-7.Google Scholar

Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the MT Summit X, Phuket, Thailand.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. June 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, June 2007, pp. 177–80. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/P07-2045 Google Scholar

Lambert, P., and Banchs, R. 2005. Data inferred multi-word expressions for statistical machine translation. In Proceedings of the MT Summit X, Phuket, Thailand, pp. 396–403.Google Scholar

Lapata, M., and Keller, F. 2005. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing 2:1–31.CrossRef Google Scholar

Lembersky, G., Ordan, N., and Wintner, S. 2011. Language models for machine translation: original vs. translated texts. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 2011, pp. 363–74. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1034 Google Scholar

Melamed, I. D. 1997. Measuring semantic entropy. In Proceedings of the SIGLEX Workshop on Tagging Text with Lexical Semantics, pp. 41–6.Google Scholar

Nakov, P., and Hearst, M. 2005. Search engine statistics beyond the n-gram: application to noun compound bracketing. In Proceedings of CoNLL '05, pp. 17–24. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.9694 CrossRef Google Scholar

Och, F. J., and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1):19–51.CrossRef Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 311–8. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dx.doi.org/10.3115/1073083.1073135.Google Scholar

Pecina, P. 2008. A machine learning approach to multiword expression extraction. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, June 2008.Google Scholar

Piao, S. S., Rayson, P., Archer, D., and McEnery, T. 2005. Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language 19 (4):378–97. ISSN . http://dx.doi.org/10.1016/j.csl.2004.11.002.CrossRef Google Scholar

Ren, Z., Lü, Y., Cao, J., Liu, Q., and Huang, Y. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, August 2009, pp. 47–54. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W09/W09-2907 Google Scholar

Rosenthal, R. 2009. Milon HaTserufim (Dictionary of Hebrew Idioms and Phrases) (in Hebrew). Jerusalem: Keter.Google Scholar

Sag, I., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City, Mexico, pp. 1–15.Google Scholar

Smadja, F. A. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19 (1):143–77.Google Scholar

Tsvetkov, Y., and Wintner, S. 2010a. Automatic acquisition of parallel corpora from websites with dynamic content. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), May 2010, pp. 3389–92. Paris, France: European Language Resources Association (ELRA). ISBN 2-9517408-6-7.Google Scholar

Tsvetkov, Y., and Wintner, S. 2010b. Extraction of multi-word expressions from small parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 28, 2010.Google Scholar

Tsvetkov, Y., and Wintner, S. 2011. Identification of multi-word expressions by combining multiple linguistic information sources. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 2011, pp. 836–45. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1077 Google Scholar

Uchiyama, K., Baldwin, T., and Ishizaki, S. October 2005. Disambiguating Japanese compound verbs. Computer Speech & Language 19 (4):497–512.CrossRef Google Scholar

Van de Cruys, T., and Villada Moirón, B. 2007. Semantics-based multiword expression extraction. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, Prague, Czech Republic, June 2007, pp. 25–32. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-1104 CrossRef Google Scholar

Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., and Trón, V. 2005. Parallel corpora for medium density languages. In Proceedings of RANLP'2005, Borovets, Bulgaria, September 21–23, 2005, pp. 590–6.Google Scholar

Venkatapathy, S., and Joshi, A. 2006. Using information about multi-word expressions for the word-alignment task. In Proceedings of the COLING/ACL Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, July 2006.Google Scholar

Venkatsubramanyan, S., and Perez-Carballo, J. 2004. Multiword expression filtering for building knowledge. In Tanaka, T., Villavicencio, A., Bond, F., and Korhonen, A. (eds.), Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, July 2004, pp. 40–7. Stroudsburg, PA, USA: Association for Computational Linguistics.CrossRef Google Scholar

Villada Moirón, B., and Tiedemann, J. 2006. Identifying idiomatic expressions using automatic word alignment. In Proceedings of the EACL 2006 Workshop on Multi-Word-Expressions in a Multilingual Context. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar

Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., and Ramisch, C. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 1034–43. http://www.aclweb.org/anthology/D/D07/D07-1110 Google Scholar

Zarrieß, S., and Kuhn, J. 2009. Exploiting translational correspondences for pattern-independent MWE identification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, August 2009, pp. 23–30. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W09/W09-2904 Google Scholar

Article contents

Extraction of multi-word expressions from small parallel corpora

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests