A classification approach for detecting cross-lingual biomedical term translations

H. HAKAMI; D. BOLLEGALA

doi:10.1017/S1351324915000431

A classification approach for detecting cross-lingual biomedical term translations

Published online by Cambridge University Press: 14 December 2015

H. HAKAMI and

D. BOLLEGALA

Show author details

H. HAKAMI: Affiliation:
Computer Science Department, Taif University, Saudi Arabia e-mail: hoda.h@tu.edu.sa
D. BOLLEGALA: Affiliation:
Department of Computer Science, The University of Liverpool, UK e-mail: danushka.bollegala@liverpool.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Finding translations for technical terms is an important problem in machine translation. In particular, in highly specialized domains such as biology or medicine, it is difficult to find bilingual experts to annotate sufficient cross-lingual texts in order to train machine translation systems. Moreover, new terms are constantly being generated in the biomedical community, which makes it difficult to keep the translation dictionaries up to date for all language pairs of interest. Given a biomedical term in one language (source language), we propose a method for detecting its translations in a different language (target language). Specifically, we train a binary classifier to determine whether two biomedical terms written in two languages are translations. Training such a classifier is often complicated due to the lack of common features between the source and target languages. We propose several feature space concatenation methods to successfully overcome this problem. Moreover, we study the effectiveness of contextual and character n-gram features for detecting term translations. Experiments conducted using a standard dataset for biomedical term translation show that the proposed method outperforms several competitive baseline methods in terms of mean average precision and top-k translation accuracy.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 1 , January 2017 , pp. 31 - 51

DOI: https://doi.org/10.1017/S1351324915000431 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baroni, M., and Lenci, A. 2010. Distributional memory: a general framework for corpus-based semantics. Computational Linguistics 36 (4): 673–721.Google Scholar

Bollegala, D., Maehara, T., and ichi Kawarabayashi, K., 2015. Embedding semantic relations into word representations. In Proceedings of IJCAI, Buenos Aires, Argentina: AAAI, pp. 1222–8.Google Scholar

Bollegala, D., Matsuo, Y., and Ishizuka, M., 2007. An integrated approach to measuring semantic similarity between words using information available on the web. In Proceedings of HTL-NAACL’07, Rochester, NY: ACL, pp. 340–7.Google Scholar

Boström, H. 2007. Estimating class probabilities in random forests. In International Conference on Machine Learning and Applications, pp. 211–6.Google Scholar

Breiman, L. 2001. Random forests. Machine Learning 45 (1): 5–32.Google Scholar

Chan, Y. S., and Ng, H. T. 2005. Word sense disambiguation with distribution estimation. In IJCAI’05, pp. 1010–5.Google Scholar

Chiao, Y.-C., and Zweigenbaum, P., 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan: ACL, pp. 1–5.Google Scholar

Claveau, V., 2008. Automatic translation of biomedical terms by supervised machine learning. In Proceedings of LREC, Marrakech, Morocco: European Language Resources Association, pp. 684–91.Google Scholar

Clopper, C. J., and Pearson, E. S. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26 (4): 404–13.Google Scholar

Dias, G., Moraliyski, R., Cordeiro, J., Doucet, A., and Ahonen-Myka, H. 2010. Automatic discovery of word semantic relations using paraphrase alignment and distributional lexical semantics analysis. Natural Language Engineering 16 (4): 439–67.Google Scholar

Díaz-Uriarte, R., and De Andres, S. A. 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7 (1): 1–13.Google Scholar

Erdmann, M., Nakayama, K., Hara, T., and Nishio, S. 2009. Improving the extraction of bilingual terminology from wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5 (4): 1–31.CrossRef Google Scholar

Fan, J.-W., and Friedman, C. 2007. Semantic classification of biomedical concepts using distributional similarity. Journal of the American Medical Informatics Association 14 (4): 467–77.Google Scholar

Kontonatsios, G., Korkontzelos, I., Tsujii, J., and Ananiadou, S., 2014a. Combining string and context similarity for bilingual term extraction from comparable corpora. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: ACL, pp. 1701–12.Google Scholar

Kontonatsios, G., Korkontzelos, I., Tsujii, J., and Ananiadou, S., 2014b. Using a random forest classifier to compile bilingual dictionaries of technical terms from comparable corpora. In Proceedings of the European Chapter for the Association for Computational Linguistics (ACL), Gothenburg, Sweden: ACL, pp. 111–6.Google Scholar

Lin, D. 1998. Automatic retrieval and clustering of similar words. In ACL 1998, pp. 768–74.Google Scholar

Mcnamee, P., and Mayfield, J. 2004. Character n-gram tokenization for european language text retrieval. Information Retrieval 7 (1–2): 73–97.Google Scholar

Mikolov, T., Chen, K., and Dean, J. 2013a. Efficient estimation of word representation in vector space. CoRR abs/1301.3781.Google Scholar

Mikolov, T., Tau Yih, W., and Zweig, G. 2013b. Linguistic regularities in continous space word representations. In NAACL’13, pp. 746–51.Google Scholar

Mitchell, J., and Lapata, M. 2008. Vector-based models of semantic composition. In ACL-HLT’08, pp. 236–44.Google Scholar

Nakov, P., and Tiedemann, J., 2012. Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of Annual Meeting of the Association for Computational Linguistics (short-papers), Jeju Island, South Korea: ACL, pp. 301–5.Google Scholar

Namer, F., and Baud, R., 2005. Predicting lexical relations between biomedical terms: towards a multilingual morphosemantics-based system. Studies in Health Technology and Informatics 116 : 793–8.Google Scholar

Rapp, R., 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, Maryland, USA: ACL, pp. 519–26.Google Scholar

Rapp, R. 2008. The automatic generation of thesauri of related words for english, french, german, and russian. International Journal of Speech Technology 11 (3–4): 147–56.Google Scholar

Saralegi, X., San Vicente, I., and Gurrutxaga, A., 2008. Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In Proceedings of Building and using Comparable Corpora Workshop, Marrakech, Morocco, pp. 27–32.Google Scholar

Tiedemann, J., 2012. Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France: ACL, pp. 141–51.Google Scholar

Tiedemann, J., and Nakov, P., 2013. Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria: INCOMA, pp. 676–84.Google Scholar

Turney, P. D., and Pantel, P., 2010. From frequency to meaning: vector space models of semantics. Journal of Aritificial Intelligence Research 37 : 141–88.Google Scholar

Vilar, D., Peter, J.-T., and Ney, H., 2007. Can we translate letters?. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic: ACL, pp. 33–9.Google Scholar

Weeds, J., Dowdall, J., Schneider, G., Keller, B., and Weir, D. 2007. Using distributional similarity to organise biomedical terminology. Application-Driven Terminology Engineering 2 (97): 107–41.Google Scholar

Xi, N., Tang, G., Dai, X., Huang, S., and Chen, J. 2012. Enhancing statistical machine translation with character alignment. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Korea: ACL, 2: 285–90.Google Scholar

Article contents

A classification approach for detecting cross-lingual biomedical term translations

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests