Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting

ROSA DEL GAUDIO; GUSTAVO BATISTA; ANTÓNIO BRANCO

doi:10.1017/S1351324912000381

Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting

Published online by Cambridge University Press: 11 February 2013

ROSA DEL GAUDIO ,

GUSTAVO BATISTA and

ANTÓNIO BRANCO

Show author details

ROSA DEL GAUDIO: Affiliation:
Faculdade de Ciências, Departamento de Informática, University of Lisbon, Campo Grande, 1749-016 Lisboa, Portugal e-mails: rosa@di.fc.ul.pt, antonio.branco@di.fc.ul.pt
GUSTAVO BATISTA: Affiliation:
Department of Computer Science, University of São Paulo, PO Box 668, 13560-970 São Carlos, SP, Brazil e-mail: gbatista@icmc.usp.br
ANTÓNIO BRANCO: Affiliation:
Faculdade de Ciências, Departamento de Informática, University of Lisbon, Campo Grande, 1749-016 Lisboa, Portugal e-mails: rosa@di.fc.ul.pt, antonio.branco@di.fc.ul.pt

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.

Type: Articles
Information: Natural Language Engineering , Volume 20 , Issue 3 , July 2014 , pp. 327 - 359

DOI: https://doi.org/10.1017/S1351324912000381 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Acedański, S., Slaski, A., and Przepiórkowski, A. 2012. Machine learning of syntactic attachment from morphosyntactic and semantic co-occurrence statistics. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, pp. 42–7. Jeju, Republic of Korea: Association for Computational Linguistics.Google Scholar

Aha, D. W., Kibler, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning, 6 (1): 37–66.Google Scholar

Alarcón, R., Sierra, G., and Bach, C. 2009. ECODE: a definition extraction system. In Vetulani, Z. and Uszkoreit, H. (eds.), Human Language Technology. Challenges of the Information Society, pp. 382–91. Berlin, Heidelberg: Springer.CrossRef Google Scholar

Alshawi, H. 1987. Processing dictionary definitions with phrasal pattern hierarchies. American Journal of Computational Linguistics 13 (3–4): 195–202.Google Scholar

Androutsopoulos, I., and Galanis, D. 2005. A practically unsupervised learning method to identify single-snippet answers to definition questions on the Web. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), pp. 323–30. Vancouver, Canada: Association for Computational Linguistics.Google Scholar

Baneyx, A., Malaisé, V., Charlet, J., Zweigenbaum, P., and Bachimont, B. 2005. Synergie entre analyse distributionnelle et patrons lexico-syntaxiques pour la construction d'ontologies différentielles. In Actes des 6 Émes Rencontres Terminologie et Intelligence Artificielle (TIA 2005), Rouen, France, pp. 31–42Google Scholar

Barnbrook, G. 2002. Defining Language: A Local Grammar of Definition Sentences. Amsterdam: John Benjamins.CrossRef Google Scholar

Batista, G. E. A. P. A., Bazzan, A. L. C., and Monard, M. C. 2003. Balancing training data for automated annotation of keywords: a case study. In Lifschitz, S., Almeida, N. F. Jr., Pappas, G. J. Jr., and Linden, R., (eds.), Proceedings of the Second Brazilian Workshop on Bioinformatics, Rio de Janeiro, pp. 35–43.Google Scholar

Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. 2004. A study of the behavior of several methods for balancing machine learning training data. Special Interest Group on Knowledge Discovery and Data Mining Explorations Newsletter – Special Issue on Learning from Imbalanced Datasets 6 (1): 20–9. New York: ACM.Google Scholar

Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. 2005. Balancing strategies and class overlapping. In Famili, A. F., Kok, J. N., Peña, J. M., Siebes, A., and Feelders, A. J. (eds.), Advances in Intelligent Data Analysis VI, Sixth International Symposium on Intelligent Data Analysis, IDA 2005, Lecture Notes in Computer Science, vol. 3646, pp. 24–35. Berlin: Springer.Google Scholar

Bay, S., Kumaraswamy, K., Anderle, M. G., Kumar, R., and Steier, D. M. 2006. Large-scale detection of irregularities in accounting data. In Proceeding of the Sixth International Conference on Data Mining, pp. 75–86. IEEE Computer Society.CrossRef Google Scholar

Biau, G. 2012. Analysis of a random forests model. Journal of Machine Learning Research 13 (Jun), 1063–95.Google Scholar

Borg, C., Rosner, M., and Pace, G. 2009. Evolutionary algorithms for definition extraction. In Proceedings of the First Workshop on Definition Extraction (WDE’09), pp. 26–32. Association for Computational Linguistics.Google Scholar

Bradley, A. P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30: 1145–59.Google Scholar

Branco, A., and Silva, J. R. 2006. LX-Suite: shallow processing tools for Portuguese. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL’06), pp. 179–83.Google Scholar

Breiman, L. 2001. Random forests. Machine Learning 45: 5–32.Google Scholar

Chang, C.-C., and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar

Chang, X., and Zheng, Q. 2007. Offline definition extraction using machine learning for knowledge-oriented question answering. In Proceeding of International Conference on Intelligent Computing ICIC (3), pp. 1286–94.Google Scholar

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16: 321–57.Google Scholar

Chawla, N. V., Japkowicz, N., and Kotcz, A. 2004. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6 (1): 1–6.Google Scholar

Chen, C., Liaw, A., and Breiman, L. 2004. Using random forest to learn imbalanced data. Technical Report, Department of Statistics, University of Berkeley.Google Scholar

de Freitas, M. C. 2007. Elaboração automática de ontologias de Domínio: Discussão e Resultados. PhD thesis, Pontifícia Universidade Católica de Rio de Janeiro.Google Scholar

Degórski, Ł., Kobyliński, Ł., and Przepiórkowski, A. 2008a. Definition extraction: improving balanced random forests. In Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2008): Computational Linguistics – Applications (CLA’08), PTI, Wisła, Poland, pp. 353–7.Google Scholar

Degórski, Ł., Marcińczuk, M. M., and Przepiórkowski, A. 2008b (May). Definition extraction using a sequential combination of baseline grammars and machine learning classifiers. In ELRA: European Language Resources Association (ed.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), pp. 837–41. Marrakech, Morocco: ELRA.Google Scholar

Demiröz, G., and Güvenir, H. A. 1997. Classification by voting feature intervals. In Proceedings of the 9th European Conference on Machine Learning, pp. 85–92. London, UK: Springer.Google Scholar

Elkan, C. 2001. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01), pp. 973–8. Seattle, WA: Morgan Kaufmann.Google Scholar

Fahmi, I., and Bouma, G. 2006. Learning to identify definitions using syntactic feature. In Basili, R. and Moschitti, A. (eds.), Proceedings of the EACL workshop on Learning Structured Information in Natural Language Applications, Trento, Italy, pp. 64–71.Google Scholar

Fawcett, T. 2004. ROC graphs: notes and practical considerations for researchers. Technical Report, HP Laboratories.Google Scholar

Hart, P. E. 1968. The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14 (3): 515–6.CrossRef Google Scholar

Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics, pp. 539–45. Morristown, NJ: Association for Computational Linguistics.Google Scholar

Ide, N., and Suderman, K. 2002. XML, corpus encoding standard, document XCES 0.2. Technical Report, Department of Computer Science, Vassar College and Equipe Langue et Dialogue, New York, USA and LORIA/CNRS, Vandouvre-les-Nancy, France.Google Scholar

John, G. H., and Langley, P. 1995. Estimating continuous distributions in Bayesian classifiers. In Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–45. San Mateo, CA: Morgan Kaufmann.Google Scholar

Joho, H., and Sanderson, M. 2000. Retrieving descriptive phrases from large amounts of free text. In Proceeding of the Ninth International Conference on Information and Knowledge Management, pp. 180–6. McLean, VA, USA: ACM.Google Scholar

Klavans, J., and Muresan, S. 2001. Evaluation of the DEFINDER system for fully automatic glossary construction. In Proceedings of the American Medical Informatics Association Symposium (AMIA 2001), pp. 324–8. New York: ACM Press.Google Scholar

Kobyliński, Ł., and Przepiórkowski, A. 2008. Definition extraction with balanced random forests. In Ranta, A. (ed.), International Conference on Natural Language Processing (GoTAL 2008), pp. 237–47. Berlin, Gothenburg: Springer.Google Scholar

Laurikkala, J. 2001. Improving identification of difficult small classes by balancing class distribution. In AIME ‘01: Proceedings of the Eighth Conference on AI in Medicine in Europe, pp. 63–6. London, UK: Springer.Google Scholar

Ling, C. X., and Sheng, V. S. 2008. Cost-sensitive learning and the class imbalance problem. In Sammut, C. (ed.), Encyclopedia of Machine Learning, pp. 231–5. New York: Springer.Google Scholar

Liu, Y., Chawla, N. V., Harper, M. P., Shriberg, E., and Stolcke, A. 2006. A study in machine learning from imbalanced data for sentence boundary detection in speech. Computer Speech and Language 20 (4): 468–94.Google Scholar

Malaise, V., Zweigenbaum, P., and Bachimont, B. 2004. Detecting semantic relations between terms in definitions. In The Third Edition of CompuTerm Workshop (CompuTerm 2004) at Coling, pp. 55–62.Google Scholar

Meyer, I. 2001. Extracting knowledge-rich contexts for terminography. Bourigault, D. (ed.), Recent Advances in Computational Terminology, pp. 279–302. Amsterdam: John Benjamins.Google Scholar

Miliaraki, S., and Androutsopoulos, I. 2004. Learning to identify single-snippet answer to definition questions. In Proceeding of the 20th International Conference on Computational Linguistic (COLING 2004), Geneva, Switzerland, pp. 1360–6.Google Scholar

Muresan, S., and Klavans, J. 2002. A method for automatically building and evaluating dictionary resources. In Proceedings of the Language Resources and Evaluation Conference (LREC), pp. 231–4.Google Scholar

Nakamura, J., and Nagao, M. 1988. Extraction of semantic information from an ordinary English dictionary and its evaluation. In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary, pp. 459–64.Google Scholar

Park, Y., Byrd, R., and Boguraev, B. K. 2002. Automatic Glossary Extraction: beyond terminology identification. In Proceeding of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, pp. 1–7.Google Scholar

Pearson, J. 1996. The expression of definitions in specialised text: a corpus-based analysis. In Gellerstam, M., Jaborg, J., Malgren, S. G., Noren, K., Rogstrom, L., and Papmehl, C. (eds.), Seventh International Congress on Lexicography (EURALEX 96), Goteborg, Sweden, pp. 817–24.Google Scholar

Prati, R. C., Batista, G. E. A. P. A., and Monard, M. C. 2011. A survey on graphical methods for classification predictive performance evaluation. IEEE Transactions on Knowledge and Data Engineering 23 (11): 1601–18.Google Scholar

Przepiórkowski, A., Marcińczuk, M. and Degórski, Ł. 2008. Noisy and imbalanced data: machine learning or manual grammars? In Text, Speech and Dialogue: 9th International Conference, TSD 2008, Lecture Notes in Artificial Intelligence, pp. 169–76. Berlin, Springer.Google Scholar

Quinlan, J. R. 1996. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4: 77–90.Google Scholar

Roth, D. 1999. Learning in natural language. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI’99), vol. 2, pp. 898–904. San Francisco, CA: Morgan Kaufmann.Google Scholar

Saggion, H. 2004. Identifying definitions in text collections for question answering. In Proceedings of the International Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 1927–30.Google Scholar

Seppälä, S. 2009 (September). A Proposal for a framework to evaluate feature relevance for terminographic definitions. In Proceedings of the First Workshop on Definition Extraction at the Recent Advances in Natural Language Processing Conference (RANLP 2009), Borovest, Bulgaria, pp. 47–53.Google Scholar

Sierra, G., Alarcón, R., Aguilar, C., and Barrón, A. 2006. Towards the building of a corpus of definitional contexts. In Proceeding of the 12th EURALEX International Congress, Torino, Italy, pp. 229–40.Google Scholar

Sierra, G., Alarcon, R., Aguilar, C., and Bach, C. 2008. Definitional verbal patterns for semantic relation extraction. Terminology 14 (1): 74–98.Google Scholar

Taft, L. M., Evans, R. S., Shyu, C. R., Egger, M. J., Chawla, N., Mitchell, J. A., Thornton, S. N., Bray, B., and Varner, M. 2009. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. Journal of Biomedical Informatics 42 (April): 356–64.Google Scholar

Tjong, E., Sang, K., Bouma, G. and de Rijke, M. 2005. Developing offline strategies for answering medical questions. In Proceedings of the AAAI-05 Workshop on Question Answering in Restricted Domains, pp. 41–5.Google Scholar

Tomanek, K., and Hahn, U. 2009. Reducing class imbalance during active learning for named entity annotation. In Proceedings of the Fifth International Conference on Knowledge Capture, K-CAP ‘09, pp. 105–12. New York: ACM.Google Scholar

Tomek, I. 1976. Two modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6 (11): 769–72.Google Scholar

Toutanova, K., and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (EMNLP’00), vol. 13, pp. 63–70. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Vatturi, P., and Wong, W.-K. 2009. Category detection using hierarchical mean shift. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), pp. 847–56. New York: ACM.Google Scholar

Walter, S., and Pinkal, M. 2006. Automatic extraction of definitions from German court decisions. In Proceedings of the Workshop on Information Extraction Beyond The Document, pp. 20–8. Sydney, Australia: Association for Computational Linguistics.Google Scholar

Weiss, G., McCarthy, K., and Zabar, B. 2007. Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In Stahlbock, R., Crone, S. F., and Lessmann, S. (eds.), Proceedings of the International Conference on Data Mining, pp. 35–41. CSREA Press.Google Scholar

Westerhout, E. 2009. Extraction of definitions using grammar-enhanced machine learning. In Proceedings of the Student Research Workshop at EACL, pp. 88–96. Athens, Greece: Association for Computational Linguistics.Google Scholar

Westerhout, E. 2010. Definition Extraction for Glossary Creation: A Study on Extracting Definitions for Semi-automatic Glossary Creation in Dutch. Utrecht, The Netherlands: LOT.Google Scholar

Westerhout, E., and Monachesi, P. 2007. Extraction of Dutch definitory contexts for eLearning purposes. In Proceedings of the Computational Linguistics in the Netherlands (CLIN 2007), Nijmegen, Netherlands, pp. 219–34.Google Scholar

Westerhout, E., and Monachesi, P. 2008. Creating glossaries using pattern-based and machine learning techniques. In Proceedings of the International Conference on Language Resources and Evaluation, pp. 3074–81.Google Scholar

Wilson, D. L. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2: 408–21.Google Scholar

Witten, I. H., and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed.San Francisco, CA: Morgan Kaufmann.Google Scholar

Wu, G., and Chang, E. 2003. Class-boundary alignment for imbalanced dataset learning. In Proceedings of the Twentieth International Conference on Machine Learning – ICML 2003 Workshop on Learning from Imbalanced Data Sets, Washington, DC, pp. 786–95.Google Scholar

Zhang, H. 2005. Exploring conditions for the optimality of naïve Bayes. International Journal of Pattern Recognition and Artificial Intelligence 19 (2): 183–98.Google Scholar

Zhu, J. 2007. Active learning for word sense disambiguation with methods for addressing the class imbalance problem. In Proceeding Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 783–90. Prague, Czech Republic: ACL.Google Scholar

Article contents

Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests