a1 Department of Information and Computer Science, Aalto University School of Science P.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi
a2 Department of Computer Science, University of Helsinki, Finland and Xerox Research Centre Europe (XRCE) 6, Chemin de Maupertuis, 38240, Meylan, France e-mail: abhishektripathi.at@gmail.com
Abstract
Vector space models are used in language processing applications for calculating semantic similarities of words or documents. The vector spaces are generated with feature extraction methods for text data. However, evaluation of the feature extraction methods may be difficult. Indirect evaluation in an application is often time-consuming and the results may not generalize to other applications, whereas direct evaluations that measure the amount of captured semantic information usually require human evaluators or annotated data sets. We propose a novel direct evaluation method based on canonical correlation analysis (CCA), the classical method for finding linear relationship between two data sets. In our setting, the two sets are parallel text documents in two languages. A good feature extraction method should provide representations that reflect the semantic content of the documents. Assuming that the underlying semantic content is independent of the language, we can study feature extraction methods that capture the content best by measuring dependence between the representations of a document and its translation. In the case of CCA, the applied measure of dependence is correlation. The evaluation method is based on unsupervised learning, it is language- and domain-independent, and it does not require additional resources besides a parallel corpus. In this paper, we demonstrate the evaluation method on a sentence-aligned parallel corpus. The method is validated by showing that the obtained results with bag-of-words representations are intuitive and agree well with the previous findings. Moreover, we examine the performance of the proposed evaluation method with indirect evaluation methods in simple sentence matching tasks, and a quantitative manual evaluation of word translations. The results of the proposed method correlate well with the results of the indirect and manual evaluations.
(Received October 11 2010)
(Revised July 14 2011)
(Accepted July 31 2011)
(Online publication September 20 2011)
Footnotes
* We are grateful to the anonymous reviewers for their detailed and insightful comments on this paper. We also thank our colleagues Marcus Dobrinkat, Timo Honkela, Arto Klami, Oskar Kohonen, and Jaakko Väyrynen for their feedback and advice. SV, MP, TL, and KL belong to the Adaptive Informatics Research Centre, an Academy of Finland Centre of Excellence. AT was at Helsinki Institute for Information Technology HIIT and Department of Computer Science, University of Helsinki when this work was done. SV was supported by Graduate School of Language Technology in Finland, MP by Finnish Graduate School in Language Studies, and KL was supported by Academy of Finland (decision number 218214).