Evaluating vector space models with canonical correlation analysis*

SAMI VIRPIOJA; MARI-SANNA PAUKKERI; ABHISHEK TRIPATHI; TIINA LINDH-KNUUTILA; KRISTA LAGUS

doi:10.1017/S1351324911000271

Evaluating vector space models with canonical correlation analysis*

Published online by Cambridge University Press: 20 September 2011

SAMI VIRPIOJA ,

MARI-SANNA PAUKKERI ,

ABHISHEK TRIPATHI ,

TIINA LINDH-KNUUTILA and

KRISTA LAGUS

Show author details

SAMI VIRPIOJA: Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi
MARI-SANNA PAUKKERI: Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi
ABHISHEK TRIPATHI: Affiliation:
Department of Computer Science, University of Helsinki, Finland and Xerox Research Centre Europe (XRCE) 6, Chemin de Maupertuis, 38240, Meylan, France e-mail: abhishektripathi.at@gmail.com
TIINA LINDH-KNUUTILA: Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi
KRISTA LAGUS: Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Vector space models are used in language processing applications for calculating semantic similarities of words or documents. The vector spaces are generated with feature extraction methods for text data. However, evaluation of the feature extraction methods may be difficult. Indirect evaluation in an application is often time-consuming and the results may not generalize to other applications, whereas direct evaluations that measure the amount of captured semantic information usually require human evaluators or annotated data sets. We propose a novel direct evaluation method based on canonical correlation analysis (CCA), the classical method for finding linear relationship between two data sets. In our setting, the two sets are parallel text documents in two languages. A good feature extraction method should provide representations that reflect the semantic content of the documents. Assuming that the underlying semantic content is independent of the language, we can study feature extraction methods that capture the content best by measuring dependence between the representations of a document and its translation. In the case of CCA, the applied measure of dependence is correlation. The evaluation method is based on unsupervised learning, it is language- and domain-independent, and it does not require additional resources besides a parallel corpus. In this paper, we demonstrate the evaluation method on a sentence-aligned parallel corpus. The method is validated by showing that the obtained results with bag-of-words representations are intuitive and agree well with the previous findings. Moreover, we examine the performance of the proposed evaluation method with indirect evaluation methods in simple sentence matching tasks, and a quantitative manual evaluation of word translations. The results of the proposed method correlate well with the results of the indirect and manual evaluations.

Type: Articles
Information: Natural Language Engineering , Volume 18 , Issue 3 , July 2012 , pp. 399 - 436

DOI: https://doi.org/10.1017/S1351324911000271 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akaho, S. 2001. A kernel method for canonical correlation analysis. In Proceedings of the International Meeting of the Psychometric Society (IMPS2001), Osaka, Japan. Berlin, Germany: Springer-Verlag.Google Scholar

Alpaydin, E. 2010. Introduction to Machine Learning, 2nd ed.Cambridge, MA, USA: MIT Press.Google Scholar

Bach, F. R., and Jordan, M. I. 2003. Kernel independent component analysis. The Journal of Machine Learning Research 3: 1–48.Google Scholar

Bagga, A., and Baldwin, B. 1998. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational Linguistics (COLING'98), Montreal, Canada, vol. 1, pp. 79–85. New Brunswick, NJ, USA: Association for Computational Linguistics.CrossRef Google Scholar

Benzécri, J.-P. 1973. L'Analyse des Données. Vol. II. L'Analyse des Correspondances. Paris, France: Dunod.Google Scholar

Bernard, J. R. L. (ed.). 1990. The Macquarie Encyclopedic Thesaurus. Sydney, Australia: The Macquarie Library.Google Scholar

Besançon, R., and Rajman, M. 2002. Evaluation of a vector space similarity measure in a multilingual framework. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), vol. 1252, Las Palmas, Spain. Paris, France: European Language Resources Association.Google Scholar

Bingham, E., and Mannila, H. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), San Francisco, CA, USA, pp. 245–250. New York, NY, USA: ACM.Google Scholar

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.Google Scholar

Borga, M. 1998. Learning Multidimensional Signal Processing. PhD thesis, Linköping University, Sweden.Google Scholar

Bradford, R. B. 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications. In Proceeding of the 17th ACM Conference on Information and Knowledge Management (CIKM '08), Napa Valley, CA, USA, pp. 153–162. New York, NY, USA: ACM.CrossRef Google Scholar

Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Chin, A. G. (ed.), Text Databases & Document Management: Theory & Practice, pp. 78–102. Hershey, PA, USA: IGI Publishing.Google Scholar

Chew, P., and Abdelali, A. 2007. Benefits of the ‘massively parallel Rosetta stone’: cross-language information retrieval with over 30 languages. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 872–879. New Brunswick, NJ, USA: Association for Computational Linguistics.Google Scholar

Coenen, F., Leng, P., Sanderson, R., and Wang, Y. J. 2007. Statistical identification of key phrases for text classification. In Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM '07), Leipzig, Germany, pp. 838–853. Berlin, Germany: Springer-Verlag.CrossRef Google Scholar

Curran, J. R., and Moens, M. 2002. Scaling context space. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia, PA, USA, pp. 231–238. New Brunswick, NJ, USA: Association for Computational Linguistics.Google Scholar

De Bie, T., and De Moor, B. 2003. On the regularization of canonical correlation analysis. In Proceedings of the Fourth International Symposium on Independent Component Analysis and Blind Source Separation (ICA2003), Nara, Japan, pp. 785–790. Kyoto, Japan: NTT Communication Science Laboratories.Google Scholar

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41 (6): 391–407.3.0.CO;2-9>CrossRef Google Scholar

Dumais, S. T. 1991. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers 23 (2): 229–236.CrossRef Google Scholar

Fellbaum, C. (ed.). 1998. WordNet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press.CrossRef Google Scholar

Finn, A., and Kushmerick, N. 2006. Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology 57 (11): 1506–1518.CrossRef Google Scholar

Gaussier, É., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, vol. 4, pp. 526–533. East Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar

Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, D. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT), Columbus, OH, USA, pp. 771–779. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar

Hardoon, D. R., and Shawe-Taylor, J. 2007. Sparse canonical correlation analysis. Technical Report, University College London, London, UK.Google Scholar

Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. 2004. Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16 (12): 2639–2664.CrossRef Google Scholar PubMed

Harman, H. H. 1960. Modern Factor Analysis. Chicago, IL, USA: University of Chicago Press.Google Scholar

Hofmann, T. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99), Stockholm, Sweden, pp. 289–296. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar

Honkela, T., Hyvärinen, A. and Väyrynen, J. J. 2010. WordICA – emergence of linguistic representations for words by independent component analysis. Natural Language Engineering 16: 277–308.CrossRef Google Scholar

Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28 (3): 321–377.CrossRef Google Scholar

Johnson, W. B., and Lindenstrauss, J. 1984. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics 26: 189–206.CrossRef Google Scholar

Jones, K. S. 1972. A statistical interpretation of term specifity and its application in retrieval. Journal of Documentation 28 (1): 11–21.CrossRef Google Scholar

Kanerva, P., Kristoferson, J., and Holst, A. 2000. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society (CogSci 2000), Philadelphia, PA, USA, p. 1036. Mahwah, NJ, USA: Erlbaum.Google Scholar

Kaski, S. 1998. Dimensionality reduction by random mapping: fast similarity computation for clustering. In Proceedings of International Joint Conference on Neural Networks (IJCNN'98), Anchorage, AK, USA, vol. 1, pp. 413–418. Piscataway, NJ, USA: IEEE.Google Scholar

Kay, J. 1992. Feature discovery under contextual supervision using mutual information. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 1992), Baltimore, MD, USA, vol. 4, pp. 79–84. Los Alamitos, CA, USA: IEEE.CrossRef Google Scholar

Kiss, G. R., Armstrong, C., Milroy, R., and Piper, J. 1973. An associative thesaurus of English and its computer analysis. In Aitkin, A. J., Bailey, R. W., and Hamilton-Smith, N. (eds.), The Computer and Literary Studies, pp. 153–165. Edinburgh, UK: Edinburgh University Press.Google Scholar

Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 79–86. Tokyo, Japan: Asia-Pacific Association for Machine Translation.Google Scholar

Koehn, P., Och, F. J., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL), Edmonton, Canada, pp. 48–54. Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar

Koster, C. H. A., and Seutter, M. 2003. Taming wild phrases. In Proceedings of the 25th European Conference on Information Retrieval Research (ECIR'03), Pisa, Italy, pp. 161–176. Berlin, Germany: Springer-Verlag.Google Scholar

Kuhn, H. W. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2: 83–97.CrossRef Google Scholar

Lai, P. L., and Fyfe, C. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (5): 365–377.CrossRef Google Scholar PubMed

Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211–240.CrossRef Google Scholar

Leurgans, S. E., Moyeed, R. A., and Silverman, B. W. 1993. Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society. Series B (Methodological) 55 (3): 725–740.CrossRef Google Scholar

Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '92), Copenhagen, Denmark, pp. 37–50. New York, NY, USA: ACM.Google Scholar

Li, Y., and Shawe-Taylor, J. 2007. Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing and Management 43 (5): 1183–1199.CrossRef Google Scholar

Lund, K., and Burgess, C. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers 28 (2): 203–208.CrossRef Google Scholar

Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press.Google Scholar

Manning, C. D., Raghavan, P. and Schütze, H. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.CrossRef Google Scholar

Melzer, T., Reiter, M., and Bischof, H. 2001. Nonlinear feature extraction using generalized canonical correlation analysis. In Dorffner, G., Bischof, H., and Hornik, K. (eds.), Proceedings of the International Conference on Artificial Neural Networks (ICANN '01), Vienna, Austria (vol. 2130 of Lecture Notes in Computer Science), pp. 353–360. Berlin, Germany: Springer-Verlag.Google Scholar

Mihalcea, R., and Simard, M. 2005. Parallel texts. Natural Language Engineering 11 (3): 239–246.CrossRef Google Scholar

Minier, Z., Bodó, Z. and Csató, L. 2007. Wikipedia-based kernels for text categorization. In Proceedings of the 9th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'07), Timisoara, Romania, pp. 157–164. Los Alamitos, CA, USA: IEEE Computer Society.Google Scholar

Mitchell, J., and Lapata, M. 2008. Vector-based models of semantic composition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08:HLT), Columbus, OH, USA, pp. 236–244. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar

Nakov, P., Popova, A., and Mateev, P. 2001. Weight functions impact on LSA performance. In Proceedings of the EuroConference on Recent Advances in Natural Language Processing (RANLP 2001), pp. 187–193. Tzigov Chark, Bulgaria: Bulgarian Academy of Sciences.Google Scholar

Nelson, D. L., McEvoy, C. L. and Schreiber, T. A. 1998. The University of South Florida word association, rhyme, and word fragment norms. http://web.usf.edu/FreeAssociation/ Tampa, FL, USA: University of South Florida (Accessed 7 Oct 2010).Google Scholar

Rapp, R. 2002. The computation of word associations: comparing syntagmatic and paradigmatic approaches. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Taipei, Taiwan, pp. 1–7, International Committee on Computational Linguistics. New Brunswick, NJ, USA: Association for Computational Linguistics.Google Scholar

Rapp, R. 2004. A freely available automatically generated thesaurus of related words. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 395–398. Paris, France: European Language Resources Association.Google Scholar

Ritter, H., and Kohonen, T. 1989. Self-organizing semantic maps. Biological Cybernetics 61: 241–254.CrossRef Google Scholar

Roget, P. 1911. Thesaurus of English Words and Phrases. London, UK: Longmans, Green.Google Scholar

Rummel, R. J. 1970. Applied Factor Analysis. Evanston, IL, USA: Northwestern University Press.Google Scholar

Sadeniemi, M., Kettunen, K., Lindh-Knuutila, T., and Honkela, T. 2008. Complexity of European Union languages: a comparative approach. Journal of Quantitative Linguistics 15 (2): 185–211.CrossRef Google Scholar

Sahlgren, M. 2006 a. Towards pertinent evaluation methodologies for word-space models. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy. Paris, France: European Language Resources Association.Google Scholar

Sahlgren, M. 2006 b. The Word-Space Model. PhD thesis, Department of Linguistics, Stockholm University, Stockholm, Sweden.Google Scholar

Sahlgren, M., and Karlgren, J. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11 (03): 327–341.CrossRef Google Scholar

Salton, G. (ed.). 1971. The SMART System – Experiments in Automatic Document Processing. Upper Saddle River, NJ, USA: Prentice-Hall.Google Scholar

Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Prosessing and Management 24 (5): 513–523.CrossRef Google Scholar

Salton, G., Wong, A., and Yang, C. 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 620.CrossRef Google Scholar

Schütze, H. 1992. Dimensions of meaning. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (SC 1992), Minneapolis, MN, USA, pp. 787–796. Los Alamitos, CA, USA: IEEE Computer Society.Google Scholar

Schütze, H., and Pedersen, J. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR95), Las Vegas, NV, USA, pp. 161–175.Google Scholar

Schütze, H., Hull, D. A., and Pedersen, J. O. 1995. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), Seattle, WA, USA, pp. 229–237. New York, NY, USA: ACM.Google Scholar

Scott, S., and Matwin, S. 1999. Feature engineering for text classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML '99), Bled, Slovenia, pp. 379–388. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34 (1): 1–47.CrossRef Google Scholar

Steyvers, M., Shiffrin, R. M., and Nelson, D. L. 2005. Word association spaces for predicting semantic similarity effects in episodic memory. In Healy, A. F. (ed.), Experimental Cognitive Psychology and Its Applications, pp. 237–249. Washington, DC, USA: American Psychological Association.CrossRef Google Scholar

Tripathi, A., Klami, A., and Kaski, S. 2008. Using dependencies to pair samples for multi-view learning. TKK Reports in Information and Computer Science TKK-ICS-R8, Helsinki University of Technology, Espoo, Finland.CrossRef Google Scholar

Tripathi, A., Klami, A., and Virpioja, S. 2010. Bilingual sentence matching using kernel CCA. In Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2010), Kittilä, Finland, pp. 130–135. Los Alamitos, CA, USA: IEEE Press.Google Scholar

Turney, P. D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Raedt, L. D. and Flach, P. A. (eds.), Proceedings of the Twelth European Conference on Machine Learning (ECML-2001), Freiburg, Germany (vol. 2167 of Lecture Notes in Computer Science), pp. 491–502. Berlin, Germany: Springer-Verlag.Google Scholar

Turney, P. D. 2005. Measuring semantic similarity by latent relational analysis. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), Edinburgh, UK, pp. 1136–1141. International Joint Conferences on Artificial Intelligence Organization. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar

Väyrynen, J. J., Lindqvist, L., and Honkela, T. 2007. Sparse distributed representations for words with thresholded independent component analysis. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2007), Orlando, FL, USA, pp. 1031–1036. Piscataway, NJ, USA: IEEE.Google Scholar

Vinokourov, A., Shawe-Taylor, J., and Cristianini, N. 2003. Inferring a semantic representation of text via cross-language correlation analysis. Advances in Neural Information Processing Systems 15: 1497–1504.Google Scholar

Yarowsky, D., and Florian, R. 2002. Evaluating sense disambiguation across diverse parameter spaces. Natural Language Engineering 8 (4): 293–310.CrossRef Google Scholar

Zelikovitz, S., and Hirsh, H. 2001. Improving text classification with LSI using background knowledge. In Nebel, B. (ed.), Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI01), Seattle, WA, USA, pp. 113–118. International Joint Conferences on Artificial Intelligence Organization. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar

Zesch, T., and Gurevych, I. 2009. Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words. Natural Language Engineering 16 (1): 25–59.CrossRef Google Scholar

Zhang, D., Mei, Q., and Zhai, C. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 1128–1137. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar

Article contents

Evaluating vector space models with canonical correlation analysis*

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests