A unified alignment algorithm for bilingual data

CHRISTOPH TILLMANN; SANJIKA HEWAVITHARANA

doi:10.1017/S135132491100026X

A unified alignment algorithm for bilingual data

Published online by Cambridge University Press: 13 September 2011

CHRISTOPH TILLMANN and

SANJIKA HEWAVITHARANA

Show author details

CHRISTOPH TILLMANN: Affiliation:
IBM T.J. Watson Research Center, Yorktown Heights, New York, NY 10598, USA email: ctill@us.ibm.com
SANJIKA HEWAVITHARANA: Affiliation:
Carnegie Mellon University, Pittsburgh, PA 15213, USA email: sanjika@cs.cmu.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The paper presents a novel unified algorithm for aligning sentences with their translations in bilingual data. With the help of ideas from a stack-based dynamic programming decoder for speech recognition (Ney 1984), the search is parametrized in a novel way such that the unified algorithm can be used on various types of data that have been previously handled by separate implementations: the extracted text chunk pairs can be either sub-sentential pairs, one-to-one, or many-to-many sentence-level pairs. The one-stage search algorithm is carried out in a single run over the data. Its memory requirements are independent of the length of the source document, and it is applicable to sentence-level parallel as well as comparable data. With the help of a unified beam-search candidate pruning, the algorithm is very efficient: it avoids any document-level pre-filtering and uses less restrictive sentence-level filtering. Results are presented on a Russian–English, a Spanish–English, and an Arabic–English extraction task. Based on simple word-based scoring features, text chunk pairs are extracted out of several trillion candidates, where the search is carried out on 300 processors in parallel.

Type: Articles
Information: Natural Language Engineering , Volume 19 , Issue 1 , January 2013 , pp. 33 - 60

DOI: https://doi.org/10.1017/S135132491100026X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Brown, P., Spohrer, J., Hochschild, P. and Baker, J. 1982. Partial traceback and dynamic programming. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 82), Paris, France, pp. 1629–32.Google Scholar

Brown, P. F., Lai, J. C. and Mercer, R. L. 1991. Aligning sentences in parallel corpora. In Proceedings of ACL 91, Berkeley, CA, pp. 169–76.Google Scholar

Brown, P. F., Della Pietra, V. J., Della Pietra, S. A. and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263–311.Google Scholar

Chen, S. F. 1993. Aligning sentences in bilingual corpora using lexical information. In Proceedings of ACL 93, June 16–17, Columbus, OH, pp. 9–16.Google Scholar

Deng, Y., Kumar, S. and Byrne, W. 2006. Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 12 (4): 1–26.Google Scholar

Fung, P. and Cheung, P. 2004. Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of EMNLP 04, July 25–26, Barcelona, Spain, pp. 57–63.Google Scholar

Hewavitharana, S. and Vogel, S. 2011. Extracting parallel phrases from comparable data. In Proceedings of ACL Workshop on Building and Using Comparable Corpora, June 24, Portland, OR, pp. 61–8.Google Scholar

Gale, W. A. and Church, K. W. 1991. A program for aligning sentences in bilingual corpora. In Proceedings of ACL 91, June 18–21, Berkeley, CA, pp. 177–84.Google Scholar

Koehn, P., Och, F. J. and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 03, May 27–June 1, Edmonton, Alberta, Canada, pp. 127–33.Google Scholar

Koehn, P. 2004. Pharaoh: a beam search decoder for phrase-based SMT models. In Proceedings of AMTA 04, September 28–October 2, Washington DC.Google Scholar

Ma, X. 2006. Champollion: a robust parallel text sentence aligner. In Proceedings of LREC 06, May 22–28, Genova, Italy, pp. 489–92.Google Scholar

Melamed, I. D. 1999. Bitext maps and alignment via pattern recognition. Computational Linguistics 25 (1): 107–30.Google Scholar

Mendonca, A., Graff, D. and DiPersio, D. 2009. Spanish Gigaword Corpus, 2nd ed., LDC catalog no. 2009T21. Philadelphia, PA: LDC.Google Scholar

Moore, R. C. 2002. Fast and accurate sentence alignment of bilingual data. In Proceedings of AMTA 05, Tiburon, CA, pp. 135–44.Google Scholar

Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477–504.CrossRef Google Scholar

Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of COLING/ACL 06, July 17–21, Sydney, Australia, pp. 81–8.Google Scholar

Ney, H. 1984. The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2): 263–71.CrossRef Google Scholar

Och, F.-J. and Ney, H. 2004. The alignment template approach to statistical machine translation. Computational Linguistics 30 (4): 417–50.CrossRef Google Scholar

Och, F. J.et al. 2004. A smorgasbord of features for statistical machine translation. In Proceedings of the Joint HLT and NAACL Conference (HLT 04), May 2–7, Boston, MA, pp. 161–8.Google Scholar

Olive, J., Christianson, C. and McCary, J. (Editors). 2011. Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. New York: Springer.CrossRef Google Scholar

Ortmanns, S., Ney, H. and Eiden, A. 1996. Language-model look-ahead for large vocabulary speech recognition. In Proceedings of ICASSP 96, May 7–9, Atlanta, GA, pp. 2095–8.Google Scholar

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 02, July 7–12, Philadelphia, PA, pp. 311–18.Google Scholar

Parker, R., Graff, D., Kong, J., Chen, K., and Maeda, K. 2009. English Gigaword Corpus, 4th ed., LDC catalog no. 2009T13. Philadelphia, PA: LDC.Google Scholar

Pike, C. and Melamed, I. D. 2004. An automatic filter for non-parallel texts. In The Comp. Volume of the Proceedings of ACL 04, July 21–26, Barcelona, Spain, pp. 114–17.Google Scholar

Quirk, C., Udupa, R. and Menezes, A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the MT Summit XI, September 10–14, Copenhagen, Demark, pp. 321–7.Google Scholar

Resnik, P. and Smith, N. 2003. The web as parallel corpus. Computational Linguistics 29 (3): 349–80.CrossRef Google Scholar

Snover, M., Dorr, B, and Schwartz, R. 2008. Language and translation model adaptation using comparable corpora. In Proceedings of EMNLP08, October 25–27, Honolulu, HI, pp. 856–5.Google Scholar

Tillmann, C. and Xu, J.-M. 2009. A simple sentence-level extraction algorithm for comparable data. In Proceedings of HLT/NAACL 09, May 31–June 5, Boulder, CO, pp. 93–6.Google Scholar

Tillmann, C. and Zhang, T. 2007. A block bigram prediction model for statistical machine translation. ACM-TSLP 4 (6): 1–31 (July).Google Scholar

Tillmann, C. 2006. Efficient dynamic programming search algorithms for phrase-based SMT. In Proceedings of the Workshop CHPSLP at HLT 06, June 4–9, New York City, NY, pp. 9–16.Google Scholar

Tillmann, C. 2009. A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference, August 2–7, Suntec, Singapore, pp. 225–8.Google Scholar

Utiyama, M. and Isahara, H. 2003. Reliable measures for aligning Japanese–English news articles and sentences. In Proceedings of ACL 03, July 7–12, Sapporo, Japan, pp. 72–9.CrossRef Google Scholar

Zhao, B. and Vogel, S. 2002. Adaptive parallel sentences mining from WebBilingualNewsCollection. In IEEE International Conference on Data Mining (ICDM 2002), December 2–12, Maebashi City, Japan, pp. 745–8.Google Scholar

Article contents

A unified alignment algorithm for bilingual data

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests