Hostname: page-component-8448b6f56d-42gr6 Total loading time: 0 Render date: 2024-04-17T18:06:40.738Z Has data issue: false hasContentIssue false

Extraction of multi-word expressions from small parallel corpora

Published online by Cambridge University Press:  21 March 2012

YULIA TSVETKOV
Affiliation:
Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA, USA e-mail: yulia.tsvetkov@gmail.com
SHULY WINTNER
Affiliation:
Department of Computer Science University of Haifa, Hafia, Israel e-mail: shuly@cs.haifa.ac.il

Abstract

We present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional way. We introduce a simple algorithm that proposes MWE candidates based on such misalignments, relying on 1:1 alignments as anchors that delimit the search space. We use a large monolingual corpus to rank and filter these candidates. Evaluation of the quality of the extraction algorithm reveals significant improvements over naïve alignment-based methods. The extracted MWEs, with their translations, are used in the training of a statistical machine translation system, showing a small but significant improvement in its performance.

Type
Articles
Copyright
Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Al-Haj, H. February 2010. Hebrew Multiword Expressions: Linguistic Properties, Lexical Representation, Morphological Processing, and Automatic Acquisition. Master's thesis, University of Haifa, Haifa, Israel.Google Scholar
Al-Haj, H., and Wintner, S. 2010. Identifying multi-word expressions by leveraging morphological and syntactic idiosyncrasy. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 2010, pp. 1018. http://www.aclweb.org/anthology/C10-1002Google Scholar
Baldwin, T., Bannard, C., Tanaka, T., and Widdows, D. 2003. An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 Workshop on Multiword expressions, Sapporo, Japan, pp. 8996. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
Baldwin, T., and Tanaka, T. July 2004. Translation by machine of complex nominals: getting it right. In Tanaka, T., Villavicencio, A., Bond, F., and Korhonen, A. (eds.), Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, pp. 2431. Stroudsburg, PA, USA: Association for Computational Linguistics.CrossRefGoogle Scholar
Bannard, C, Baldwin, T, and Lascarides, A. 2003. A statistical approach to the semantics of verb-particles. In Bond, D. M. F., Korhonen, A., and Villavicencio, A. (eds.), Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 65–72. http://www.aclweb.org/anthology/W03-1809.pdfCrossRefGoogle Scholar
Bar-Haim, R., Sima'an, K., and Winter, Y. June 2005. Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, MI, USA, pp. 3946. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W05/W05-0706CrossRefGoogle Scholar
Bird, S., Klein, E., and Loper, E. 2009. Natural Language Processing with Python. Sebastopol, CA: O'Reilly Media.Google Scholar
Bouma, G. 2009. Normalized (pointwise) mutual information in collocation extraction. In Von der Form zur Bedeutung: Texte Automatisch Verarbeiten/From Form to Meaning: Processing Texts Automatically, Proceedings of the Biennial GSCL Conference 2009, Tübingen: Gunter Narr Verlag, pp. 3140.Google Scholar
Brants, T., and Franz, A. 2006. Web 1T 5-gram version 1.1. LDC Catalog No. LDC2006T13. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13Google Scholar
Carpuat, M., and Diab, M. 2010. Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, June 2010, pp. 242–5. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/N10-1029Google Scholar
Caseli, H., Villavicencio, A., Machado, A., and Finatto, M. J. 2009. Statistically driven alignment-based multiword expression identification for technical domains. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, August 2009, pp. 18. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W09/W09-2901Google Scholar
Chang, B., Danielsson, P., and Teubert, W. 2002. Extraction of translation unit from Chinese-English parallel corpora. In Proceedings of the first SIGHAN Workshop on Chinese Language Processing, Morristown, NJ, USA, pp. 15. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dx.doi.org/10.3115/1118824.1118825Google Scholar
Church, K. W., and Hanks, P. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16 (1):22–9. .Google Scholar
Cook, P., Fazly, A., and Stevenson, S. 2007. Pulling their weight: exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the ACL Workshop on A Broader Perspective on Multiword Expressions (MWE 2007), Prague, Czech Republic, June 2007, pp. 41–8. Stroudsburg, PA, USA: ACL.CrossRefGoogle Scholar
Daille, B. 1994. Approche Mixte Pour L'extraction Automatique de Terminologie: Statistiques Lexicales et Filtres Linguistiques. PhD thesis, Université Paris, Paris, France.Google Scholar
Dejean, H., Gaussier, E., Goutte, C., and Yamada, K. 2003. Reducing parameter space for word alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts, Morristown, NJ, USA, pp. 23–6. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dx.doi.org/10.3115/1118905.1118910.Google Scholar
Doucet, A., and Ahonen-Myka, H. 2004. Non-contiguous word sequences for information retrieval. In Tanaka, T., Villavicencio, A., Bond, F., and Korhonen, A. (eds.), Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, July 2004, pp. 8895. Stroudsburg, PA, USA: Association for Computational Linguistics.CrossRefGoogle Scholar
Erman, B., and Warren, B. 2000. The idiom principle and the open choice principle. Text 20 (1):2962.Google Scholar
Fellbaum, C. (ed.) 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: Language, Speech and Communication, MIT Press.CrossRefGoogle Scholar
Graff, D., and Cieri, C. 2007. English Gigaword, 3rd. ed. LDC Catalog No. LDC2007T07. Philadelphia, PA, USA: Linguistic Data Consortium.Google Scholar
Itai, A., and Wintner, S. March 2008. Language resources for Hebrew. Language Resources and Evaluation 42 (1):7598.CrossRefGoogle Scholar
Jackendoff, R. 1997. The Architecture of the Language Faculty. Cambridge, MA, USA: MIT Press.Google Scholar
Katz, G., and Giesbrecht, E. 2006. Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, July 2006, pp. 1219. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W06/W06-1203CrossRefGoogle Scholar
Kirschenbaum, A., and Wintner, S. 2010. A general method for creating a bilingual transliteration dictionary. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, May 2010, pp. 273–6. Paris, France: European Language Resources Association (ELRA). ISBN 2-9517408-6-7.Google Scholar
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the MT Summit X, Phuket, Thailand.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. June 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, June 2007, pp. 177–80. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/P07-2045Google Scholar
Lambert, P., and Banchs, R. 2005. Data inferred multi-word expressions for statistical machine translation. In Proceedings of the MT Summit X, Phuket, Thailand, pp. 396403.Google Scholar
Lapata, M., and Keller, F. 2005. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing 2:131.CrossRefGoogle Scholar
Lembersky, G., Ordan, N., and Wintner, S. 2011. Language models for machine translation: original vs. translated texts. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 2011, pp. 363–74. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1034Google Scholar
Melamed, I. D. 1997. Measuring semantic entropy. In Proceedings of the SIGLEX Workshop on Tagging Text with Lexical Semantics, pp. 41–6.Google Scholar
Nakov, P., and Hearst, M. 2005. Search engine statistics beyond the n-gram: application to noun compound bracketing. In Proceedings of CoNLL '05, pp. 17–24. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.9694CrossRefGoogle Scholar
Och, F. J., and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1):1951.CrossRefGoogle Scholar
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 311–8. Stroudsburg, PA, USA: Association for Computational Linguistics. http://dx.doi.org/10.3115/1073083.1073135.Google Scholar
Pecina, P. 2008. A machine learning approach to multiword expression extraction. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, Marrakech, Morocco, June 2008.Google Scholar
Piao, S. S., Rayson, P., Archer, D., and McEnery, T. 2005. Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language 19 (4):378–97. ISSN . http://dx.doi.org/10.1016/j.csl.2004.11.002.CrossRefGoogle Scholar
Ren, Z., , Y., Cao, J., Liu, Q., and Huang, Y. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, August 2009, pp. 4754. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W09/W09-2907Google Scholar
Rosenthal, R. 2009. Milon HaTserufim (Dictionary of Hebrew Idioms and Phrases) (in Hebrew). Jerusalem: Keter.Google Scholar
Sag, I., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City, Mexico, pp. 115.Google Scholar
Smadja, F. A. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19 (1):143–77.Google Scholar
Tsvetkov, Y., and Wintner, S. 2010a. Automatic acquisition of parallel corpora from websites with dynamic content. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), May 2010, pp. 3389–92. Paris, France: European Language Resources Association (ELRA). ISBN 2-9517408-6-7.Google Scholar
Tsvetkov, Y., and Wintner, S. 2010b. Extraction of multi-word expressions from small parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 28, 2010.Google Scholar
Tsvetkov, Y., and Wintner, S. 2011. Identification of multi-word expressions by combining multiple linguistic information sources. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 2011, pp. 836–45. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1077Google Scholar
Uchiyama, K., Baldwin, T., and Ishizaki, S. October 2005. Disambiguating Japanese compound verbs. Computer Speech & Language 19 (4):497512.CrossRefGoogle Scholar
Van de Cruys, T., and Villada Moirón, B. 2007. Semantics-based multiword expression extraction. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, Prague, Czech Republic, June 2007, pp. 2532. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-1104CrossRefGoogle Scholar
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., and Trón, V. 2005. Parallel corpora for medium density languages. In Proceedings of RANLP'2005, Borovets, Bulgaria, September 21–23, 2005, pp. 590–6.Google Scholar
Venkatapathy, S., and Joshi, A. 2006. Using information about multi-word expressions for the word-alignment task. In Proceedings of the COLING/ACL Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, July 2006.Google Scholar
Venkatsubramanyan, S., and Perez-Carballo, J. 2004. Multiword expression filtering for building knowledge. In Tanaka, T., Villavicencio, A., Bond, F., and Korhonen, A. (eds.), Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, July 2004, pp. 40–7. Stroudsburg, PA, USA: Association for Computational Linguistics.CrossRefGoogle Scholar
Villada Moirón, B., and Tiedemann, J. 2006. Identifying idiomatic expressions using automatic word alignment. In Proceedings of the EACL 2006 Workshop on Multi-Word-Expressions in a Multilingual Context. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., and Ramisch, C. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 1034–43. http://www.aclweb.org/anthology/D/D07/D07-1110Google Scholar
Zarrieß, S., and Kuhn, J. 2009. Exploiting translational correspondences for pattern-independent MWE identification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, August 2009, pp. 2330. Stroudsburg, PA, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W09/W09-2904Google Scholar