Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

L. BENTIVOGLI; E. PIANTA

doi:10.1017/S1351324905003839

Abstract

In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested and extensively applied for the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor the texts are aligned at the word level and word sense annotated with a shared inventory of senses. A number of experiments have been carried out to evaluate the different steps involved in the methodology and the results suggest that the transfer approach is one promising solution to the resource bottleneck. First, it leads to the creation of a parallel corpus, which represents a crucial resource per se. Second, it allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.

Crossref Citations

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Izquierdo-Beviá, Rubén Moreno-Monteagudo, Lorenza Navarro, Borja and Suárez, Armando 2006. MICAI 2006: Advances in Artificial Intelligence. Vol. 4293, Issue. , p. 879.

Navigli, Roberto 2009. Word sense disambiguation. ACM Computing Surveys, Vol. 41, Issue. 2, p. 1.

Robaldo, Livio Caselli, Tommaso Russo, Irene and Grella, Matteo 2011. Computational Linguistics and Intelligent Text Processing. Vol. 6609, Issue. , p. 177.

de Souza, José Guilherme Camargo and Orăsan, Constantin 2011. Anaphora Processing and Applications. Vol. 7099, Issue. , p. 59.

Nitta, Yoshihiko 2012. Proceedings of the 2011 2nd International Congress on Computer Applications and Computational Science. Vol. 144, Issue. , p. 223.

Nitta, Yoshihiko 2012. Formal interpretation of HAIKU and its application to communication interface. p. 1611.

Nitta, Yoshihiko 2012. Functional treatment of HAIKU and its application to language education. p. 2923.

Nitta, Yoshihiko 2012. Recent Advances in Computer Science and Information Engineering. Vol. 126, Issue. , p. 443.

Basile, Pierpaolo 2013. Evaluation of Natural Language and Speech Tools for Italian. Vol. 7689, Issue. , p. 176.

Tonelli, Sara Giuliano, Claudio and Tymoshenko, Kateryna 2013. Wikipedia-based WSD for multilingual frame annotation. Artificial Intelligence, Vol. 194, Issue. , p. 203.

Bhattacharyya, Pushpak and Khapra, Mitesh 2013. Emerging Applications of Natural Language Processing. p. 22.

Nitta, Yoshihiko 2015. Efficient but soft communication by Haiku-like fragmental sentences. p. 981.

Bond, Francis and Bonansinga, Giulia 2015. Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015. p. 56.

Vossen, Piek Agerri, Rodrigo Aldabe, Itziar Cybulska, Agata van Erp, Marieke Fokkens, Antske Laparra, Egoitz Minard, Anne-Lyse Palmero Aprosio, Alessio Rigau, German Rospocher, Marco and Segers, Roxane 2016. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news. Knowledge-Based Systems, Vol. 110, Issue. , p. 60.

Guglielmi, Francesca Basile, Pierpaolo Curci, Antonietta and Semeraro, Giovanni 2016. Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016. p. 168.

Basile, Pierpaolo and Novielli, Nicole 2018. Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018. p. 34.

Nitta, Yoshihiko 2018. Content Generation Through Narrative Communication and Simulation. p. 286.

Nitta, Yoshihiko 2019. Post-Narratology Through Computational and Cognitive Approaches. p. 449.

Magnini, Bernardo Delmonte, Rodolfo and Tonelli, Sara 2019. In Memory of Emanuele Pianta’s Contribution to Computational Linguistics. Italian Journal of Computational Linguistics, Vol. 5, Issue. 2, p. 95.

Montenegro, C. Santana, R. and Lozano, J. A. 2019. Data generation approaches for topic classification in multilingual spoken dialog systems. p. 211.

Download full list

Article contents

Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

Abstract

Access options

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

Abstract

Access options

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests