Hostname: page-component-8448b6f56d-42gr6 Total loading time: 0 Render date: 2024-04-18T01:44:55.819Z Has data issue: false hasContentIssue false

Automatic bilingual lexicon acquisition using random indexing of parallel corpora

Published online by Cambridge University Press:  21 September 2005

M. SAHLGREN
Affiliation:
Swedish Institute of Computer Science, SICS, Box 1263, SE-164 29 Kista, Sweden e-mail: mange@sics.se, jussi@sics.se
J. KARLGREN
Affiliation:
Swedish Institute of Computer Science, SICS, Box 1263, SE-164 29 Kista, Sweden e-mail: mange@sics.se, jussi@sics.se

Abstract

This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition. The approach, which uses the Random Indexing vector space methodology, is based on finding correlations between terms based on their distributional characteristics. The approach requires a minimum of preprocessing and linguistic knowledge, and is efficient, fast and scalable. In this paper, we explain how our approach differs from traditional cooccurrence-based word alignment algorithms, and we demonstrate how to extract bilingual lexica using the Random Indexing approach applied to aligned parallel data. The acquired lexica are evaluated by comparing them to manually compiled gold standards, and we report overlap of around 60%. We also discuss methodological problems with evaluating lexical resources of this kind.

Type
Papers
Copyright
2005 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)