Hostname: page-component-8448b6f56d-jr42d Total loading time: 0 Render date: 2024-04-23T17:10:46.843Z Has data issue: false hasContentIssue false

Adapting SVM for data sparseness and imbalance: a case study in information extraction

Published online by Cambridge University Press:  01 April 2009

YAOYONG LI
Affiliation:
Department of Computer Science, The University of SheffieldRegent Court, 211 Portobello, Sheffield S1 4DP, UK e-mail: yaoyong@dcs.shef.ac.uk, kalina@dcs.shef.ac.uk, hamish@dcs.shef.ac.uk
KALINA BONTCHEVA
Affiliation:
Department of Computer Science, The University of SheffieldRegent Court, 211 Portobello, Sheffield S1 4DP, UK e-mail: yaoyong@dcs.shef.ac.uk, kalina@dcs.shef.ac.uk, hamish@dcs.shef.ac.uk
HAMISH CUNNINGHAM
Affiliation:
Department of Computer Science, The University of SheffieldRegent Court, 211 Portobello, Sheffield S1 4DP, UK e-mail: yaoyong@dcs.shef.ac.uk, kalina@dcs.shef.ac.uk, hamish@dcs.shef.ac.uk

Abstract

Support Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark data sets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications1.

Type
Papers
Copyright
Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Califf, M. E. 1998. Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis, University of Texas at Austin.Google Scholar
Campbell, C., Cristianini, N., and Smola, A. 2000. Query Learning with Large Margin Classifiers. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-00). Morgan Kaufmann.Google Scholar
Cancedda, N., Cesa-Bianchi, N., Conconi, A., Gentile, C., Goutte, C.Graepel, T., Li, Y., Renders, J. M., and Shawe-Taylor, J. 2003. Kernel methods for document filtering. In Voorhees, E. M. and Buckland, Lori P., (editors, Proceedings of The Eleventh Text Retrieval Conference (TREC 2002). The NIST.Google Scholar
Carreras, X., Màrquez, L., and Padró, L. 2003. Learning a perceptron-based named entity chunker via online recognition feedback. In Proceedings of CoNLL-2003, pages 156–159. Edmonton, Canada.CrossRefGoogle Scholar
Chapelle, O., Weston, J., Bottou, L., and Vapnik, V. 2000. Vicinal risk minimization. In NIPS, pp. 416422. MIT Press.Google Scholar
Chieu, H. L., and Ng, H. T. 2002a. A maximum entropy approach to information extraction from semi-structured and free text. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pp. 786–791. MIT Press.Google Scholar
Chieu, H. L., and Ng, H. T. 2002b. Named entity recognition: A maximum entropy approach using global information. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Taipei, Taiwan.CrossRefGoogle Scholar
Ciravegna, F. 2001. (LP)2, an adaptive algorithm for information extraction from web-related texts. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA.Google Scholar
Ciravegna, F., Dingli, A., Petrelli, D., and Wilks, Y. 2002. User-system cooperation in document annotation based on information extraction. In 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02), pp. 122–137, Siguenza, Spain.CrossRefGoogle Scholar
Collobert, R., Sinz, F., Weston, J., and Bottou, L. 2006. Large scale transductive SVMs. Journal of Machine Learning Research, 7: 16871712.Google Scholar
Crammer, K., and Singer, Y. 2001. On the algorithmic implementation of multi-class Kernel-based vector machines. Journal of Machine Learning Research, 2: 265292.Google Scholar
Cristianini, N., and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press.CrossRefGoogle Scholar
Cumby, C., and Roth, D. 2003. On Kernel methods for relational learning. In Proceedings of the 10th International Conference on Machine Learning (ICML-2003), pp. 107–114. Morgan Kaufmann.Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. 2002. GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Springer.Google Scholar
Day, D., Aberdeen, J., Hirschman, L., Kozierok, R., Robinson, P., and Vilain, M. 1997. Mixed-initiative development of language processing systems. In Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP-97).CrossRefGoogle Scholar
Finn, A., and Kushmerick, N. 2003. Active learning selection strategies for information extraction. In ECML-03 Workshop on Adaptive Text Extraction and Mining.Google Scholar
Florian, R., Ittycheriah, A., Jing, H., and Zhang, T. 2003. Named entity recognition through classifier combination. In Proceedings of CoNLL-2003, pp. 168–171. Edmonton, Canada.CrossRefGoogle Scholar
Freigtag, D., and McCallum, A. K. 1999. Information extraction with HMMs and shrinkage. In Proceesings of Workshop on Machine Learnig for Information Extraction, pp. 31–36.Google Scholar
Freitag, D. 1998. Machine Learning for Information Extraction in Informal Domains. Ph.D. thesis, Carnegie Mellon University.Google Scholar
Freitag, D., and Kushmerick, N. 2000. Boosted wrapper induction. In Proceedings of AAAI 2000. MIT Press.Google Scholar
Gimenez, J., and Marquez, L. 2003. Fast and accurate part-of-speech tagging: the SVM approach revisited. In Proceedings of the International Conference RANLP-2003 (Recent Advances in Natural Language Processing), pp. 158–165. John Benjamins Publishers.CrossRefGoogle Scholar
Hacioglu, K., Pradhan, S., Ward, W., Martin, J. H., and Jurafsky, D. 2004. Semantic role labeling by tagging syntactic chunks. In Proceedings of CoNLL-2004, pp. 110–113. Boston, MA, USA.Google Scholar
Hsu, C.-W., and Lin, C.-J. 2002. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13: 415425.Google Scholar
Hwa, R. 2004. Sample Selection for Statistical Parsing. Computational Linguistics, 30 (3): 253276.CrossRefGoogle Scholar
Isozaki, H., and Kazawa, H. 2002. Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), pp. 390–396, Taipei, Taiwan.CrossRefGoogle Scholar
Jelinek, F. 1997. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA.Google Scholar
Joachims, T. 1999a. Making large-scale SVM learning practical. In Schölkopf, B., Burges, C. J. C., and Smola, A. J., (eds.), Advances in Kernel Methods – Support Vector Learning, pp. 169184. MIT Press.Google Scholar
Joachims, T. 1999b. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML-99). Morgan Kaufmann.Google Scholar
Jones, R. 2005. Learning to Extract Entities from Labelled and Unlabelled Text. Ph.D. thesis, School of Computer Science, Carnegie Mellon University.Google Scholar
Kudo, T., and Matsumoto, Y. 2000. Use of support vector learning for chunk identification. In Proceedings of Sixth Conference on Computational Natural Language Learning (CoNLL-2000). Lisbon, Portugal.CrossRefGoogle Scholar
Kudoh, T., and Matsumoto, Y. 2000. Japanese dependency structure analysis based on support vector machines. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Association for Computational Linguistics.CrossRefGoogle Scholar
Lee, Y., Ng, H., and Chia, T. 2004. Supervised word sense disambiguation with support vector machines and multiple knowledge sources. In Proceedings of SENSEVAL-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 137–140. Association for Computational Linguistics.Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5: 361397.Google Scholar
Li, Y., Bontcheva, K., and Cunningham, H. 2005a. SVM based learning system for information extraction. In Niranjan, M.Winkler, J. and Lawerence, N., (eds.), Deterministic and Statistical Methods in Machine Learning, LNAI 3635, pp. 319339. Springer Verlag.CrossRefGoogle Scholar
Li, Y., Bontcheva, K., and Cunningham, ,. 2005b. Using uneven margins SVM and perceptron for information extraction. In Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005). Association for Computational Linguistics.CrossRefGoogle Scholar
Li, Y., and Shawe-Taylor, J. 2003. The SVM with uneven margins and Chinese document categorization. In Proceedings of The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), Singapore.Google Scholar
Mayfield, J., McNamee, P., and Piatko, C. 2003. Named entity recognition using hundreds of thousands of features. In Proceedings of CoNLL-2003, pp. 184–187. Edmonton, Canada.CrossRefGoogle Scholar
Morik, K., Brockhausen, P. and Joachims, T. 1999. Combining statistical learning with a knowledge based approach – a case study in intensive care monitoring. In Proceedings of the 16th International Conference on Machine Learning (ICML-99), pages 268–277, San Francisco, CA.Google Scholar
Nakagawa, T., Kudoh, T., and Matsumoto, Y. 2001. Unknown word guessing and part-of-speech tagging using support vector machines. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium. Tokyo, Japan.Google Scholar
Ngai, G., and Yarowsky, D. 2000. Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 117–125, Hongkong.CrossRefGoogle Scholar
Rifkin, R., and Klautu, A. 2004. In defense of one-vs-all classification. Journal of Machine Learning Research, 5: 101141.Google Scholar
Roth, D., and Yih, W. T. 2001. Relational learning via propositional algorithms: an information extraction case study. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI), pp. 1257–1263. Springer.Google Scholar
Sassano, M. 2002. An empirical study of active learning with support vector machines for Japanese word segmentation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics.CrossRefGoogle Scholar
Schohn, G., and Cohn, D. 2000. Less is more: active learning with support vector machines. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-00). Morgan Kaufmann.Google Scholar
Shawe-Taylor, J., and Cristianini, N. 1999. Margin distribution bounds on generalization. In Proceedings of European Conference on Computational Learning Theory, EuroCOLT'99, pp. 263–273. Springer.CrossRefGoogle Scholar
Sitter, A. De, and Daelemans, W. 2003. Information extraction via double classification. In Proceedings of ECML/PRDD 2003 Workshop on Adaptive Text Extraction and Mining (ATEM 2003), Cavtat-Dubrovnik, Croatia.Google Scholar
Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Machine Learning, 34 (1): 233272.CrossRefGoogle Scholar
Tjong Kim Sang, E. F., and Meulder, F. D. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of CoNLL-2003, pages 142–147. Edmonton, Canada.CrossRefGoogle Scholar
Tong, S., and Koller, D. 2001. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2: 4566.Google Scholar
Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.CrossRefGoogle Scholar
Tur, G., Schapire, R. E. and Hakkani-Tur, D. 2003. Active learning for spoken language understanding. In Proceedings of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 276–279. IEEE Press.CrossRefGoogle Scholar
Vapnik, V. 1998. Statistical Learning Theory. John Wiley & Sons.Google Scholar
Vlachos, A. 2004. Active Learning with Support Vector Machines. MSc thesis, University of Edinburgh.Google Scholar
Wu, T., and Pottenger, W. 2005. A semi-supervised active learning algorithm for information extraction from textual data. Journal of the American Society for Information Science and Technology, 56 (3): 258271.CrossRefGoogle Scholar
Yamada, H., and Matsumoto, Y. 2003. Statistical dependency analysis with support vector machines. In The 8th International Workshop of Parsing Technologies (IWPT2003). Kluwer, Dordreht/Boston/London.Google Scholar
Yang, Y. 2001. A study on thresholding strategies for text categorization. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), pp. 137–145, New York, NY.CrossRefGoogle Scholar
Zhang, J., and Mani, I. 2003. kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets. Association for Computing Machinery.Google Scholar
Zhou, G., Su, J., Zhang, J., and Zhang, M. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the ACL, pp. 427–434. Association for Computational Linguistics.Google Scholar