TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet*

MENA B. HABIB; MAURICE VAN KEULEN

doi:10.1017/S1351324915000194

TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet*

Published online by Cambridge University Press: 10 July 2015

MENA B. HABIB and

MAURICE VAN KEULEN

Show author details

MENA B. HABIB: Affiliation:
Database Chair, University of Twente, Enschede, the Netherlands e-mail: m.b.habib@ewi.utwente.nl, m.vankeulen@ewi.utwente.nl
MAURICE VAN KEULEN: Affiliation:
Database Chair, University of Twente, Enschede, the Netherlands e-mail: m.b.habib@ewi.utwente.nl, m.vankeulen@ewi.utwente.nl

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 3 , May 2016 , pp. 423 - 456

DOI: https://doi.org/10.1017/S1351324915000194 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The authors would like to thank Zhemin Zhu for sharing his CRF model (Zhu et al.2013) and assisting us in applying it. This work is supported by the Dutch national research program COMMIT.

References

Abeel, T., Van de Peer, Y., and Saeys, Y. 2009. Java-ml: a machine learning library. Journal of Machine Learning Research 10 : 931–4.Google Scholar

Basave, A. E. C., Varga, A., Rowe, M., Stankovic, M., and Dadzie, A.-S. 2013. Making sense of microposts (#msm2013) concept extraction challenge. In Making Sense of Microposts (#MSM2013) Concept Extraction Challenge, Rio de Janeiro, Brazil, pp. 1–15.Google Scholar

Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M., Maynard, D., and Aswani, N. 2013. Twitie: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, Hissar, Bulgaria, pp. 83–90.Google Scholar

Bunescu, R. C., and Pasca, M. 2006. Using encyclopedic knowledge for named entity disambiguation. In EACL, Trento, Italy, pp. 9–16.Google Scholar

Cano Basave, A. E., Rizzo, G., Varga, A., Rowe, M., Stankovic, M., and Dadzie, A.-S. 2014. Making sense of microposts (#microposts2014) named entity extraction & linking challenge. In Proceedings of the 4th Workshop on Making Sense of Microposts (#Microposts2014), Seoul, South Korea, pp. 54–60.Google Scholar

Castillo, C., Mendoza, M., and Poblete, B. 2011. Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India. ACM, pp. 675–84.CrossRef Google Scholar

Chang, C.-C. and Lin, C.-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (3–27): 1–27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.CrossRef Google Scholar

Christoforaki, M., Erunse, I., and Yu, C. 2011. Searching social updates for topic-centric entities. In Proceedings of the 1st International Workshop on Searching and Integrating New Web Data Sources – Very Large Data Search (VLDS), Seattle, WA, USA, pp. 34–9.Google Scholar

Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 708–16.Google Scholar

Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. 2002. GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), Philadelphia, Pennsylvania, USA, pp. 168–75.Google Scholar

Dann, S. 2010. Twitter content classification. First Monday 15 (12), http://firstmonday.org/ojs/index.php/fm/article/viewArticle/2745/2681.Google Scholar

Davis, A., Veloso, A., da Silva, A. S., Meira, W. Jr, and Laender, A. H. F. 2012. Named entity disambiguation in streaming data. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers – Volume 1, ACL ’12, Jeju Island, Korea, pp. 815–24.Google Scholar

Delgado, A. D., Mart’ınez, R., Pérez Garc’ıa-Plaza, A., and Fresno, V. 2012. Unsupervised Real-Time company name disambiguation in twitter. In Workshop on Real-Time Analysis and Mining of Social Streams (RAMSS), Palo Alto, California, USA, pp. 25–8.Google Scholar

Derczynski, L. and Bontcheva, K. 2013. Mining social media with linked open data, entity recognition, and event extraction. In Proceedings of the 3rd Workshop on Data Extraction and Object Search (DEOS 2013), Oxford, UK.Google Scholar

Dice, L. R. 1945. Measures of the amount of ecologic association between species. Ecology 26 (3): 297–302.CrossRef Google Scholar

Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, University of Michigan, USA, pp. 363–70.Google Scholar

Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., and Smith, N. A. 2011. Part-of-speech tagging for twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short papers – Volume 2, HLT ’11, Portland, Oregon, USA, pp. 42–7.Google Scholar

Gupta, P., Goel, A., Lin, J., Sharma, A., Wang, D., and Zadeh, R. 2013. Wtf: the who to follow service at twitter. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, Rio de Janeiro, Brazil, pp. 505–14.CrossRef Google Scholar

Habib, M. B. and van Keulen, M. 2012a. Improving toponym disambiguation by iteratively enhancing certainty of extraction. In Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012, Barcelona, Spain. SciTePress, pp. 399–410.Google Scholar

Habib, M. B. and van Keulen, M. 2012b. Unsupervised improvement of named entity extraction in short informal context using disambiguation clues. In Proc. of the Workshop on Semantic Web and Information Extraction (SWAIE 2012), Galway, Ireland, pp. 1–10.Google Scholar

Habib, M. B. and van Keulen, M. 2013. A hybrid approach for robust multilingual toponym extraction and disambiguation. In IIS, Warsaw, Poland, pp. 1–15.Google Scholar

Hoffart, J., Yosef, M. A., Bordino, I., Frstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., and Weikum, G. 2011. Robust disambiguation of named entities in text. In Proceedings of EMNLP 2011, Edinburgh, Scotland, UK, pp. 782–92.Google Scholar

Howard, P. and Hussain, M. 2013. Democracy’s Fourth Wave?: Digital Media and the Arab Spring, Oxford Studies in Digital Politics. USA: OUP.CrossRef Google Scholar

Jung, J. J. 2012. Online named entity recognition method for microtexts in social networking services: a case study of twitter. Expert Systems with Applications 39 (9): 8066–70.CrossRef Google Scholar

Kulkarni, S., Singh, A., Ramakrishnan, G., and Chakrabarti, S. 2009. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, Paris, France, pp. 457–66.CrossRef Google Scholar

Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., and Lee, B.-S. 2012. Twiner: named entity recognition in targeted twitter stream. In SIGIR, Portland, Oregon, USA, pp. 721–30.Google Scholar

Li, L., Yu, Z., Zou, J., Su, L., Xian, Y., and Mao, C. 2009. Research on the method of entity homepage recognition. Journal of Computational Information Systems (JCIS) 5 (4): 1617–24.Google Scholar

Lin, T., Mausam, , and Etzioni, O.,2012. Entity linking at web scale. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), Montreal, Canada, pp. 84–8.Google Scholar

Locke, B., and Martin, J. 2009. Named entity recognition: adapting to microblogging. Senior Thesis, University of Colorado.Google Scholar

MacKay, D. J., and Peto, L. C. B. 1994. A hierarchical dirichlet language model. Natural Language Engineering 1 : 1–19.Google Scholar

Marsh, E., and Perzanowski, D. 1998. Muc-7 evaluation of ie technology: overview of results. In Proceedings of the 7th Message Understanding Conference (MUC-7).Google Scholar

McCallum, A., and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of CoNLL 2003, Edmonton, Canada, pp. 188–91.Google Scholar

Mendes, P. N., Jakob, M., García-Silva, A., and Bizer, C. 2011. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics ’11, New York, NY, USA. ACM, pp. 1–8.Google Scholar

Ritter, A., Clark, S., Mausam, , and Etzioni, O. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of EMNLP 2011, Edinburgh, Scotland, UK, pp. 1524–34.Google Scholar

Rizzo, G. and Troncy, R. 2011. Nerd: Evaluating named entity recognition tools in the web of data. In ISWC’11, Workshop on Web Scale Knowledge Extraction (WEKEX’11), Bonn, Germany.Google Scholar

Spina, D., Amigó, E., and Gonzalo, J. 2011. Filter keywords and majority class strategies for company name disambiguation in twitter. In Proceedings of the 2nd International Conference on Multilingual and Multimodal Information Access Evaluation, CLEF’11, Amsterdam, The Netherlands, pp. 50–61.CrossRef Google Scholar

Srinivasan, H., Chen, J., and Srihari, R. 2009. Cross document person name disambiguation using entity profiles. In Proceedings of the Text Analysis Conference (TAC) Workshop, Gaithersburg, Maryland, USA.Google Scholar

Steiner, T., Verborgh, R., Gabarró Vallés, J., and Van de Walle, R. 2013. Adding meaning to social network microposts via multiple named entity disambiguation apis and tracking their data provenance. International Journal of Computer Information Systems and Industrial Management 5 : 69–78.Google Scholar

Suchanek, F. M., Kasneci, G., and Weikum, G. 2007. Yago: a core of semantic knowledge. In Proc. of the 16th International Conference on World Wide Web, WWW ’07, Banff, Alberta, Canada, pp. 697–706.CrossRef Google Scholar

Sullivan, S. J., Schneiders, A. G., Cheang, C.-W., Kitto, E., Lee, H., Redhead, J., Ward, S., Ahmed, O. H., and McCrory, P. R. 2012. what’s happening? A content analysis of concussion-related traffic on twitter. British Journal of Sports Medicine 46 (4): 258–63.CrossRef Google Scholar PubMed

Sutton, C. and McCallum, A. 2005. Piecewise training of undirected models. In Proceedings of UAI, Edinburgh, Scotland, UK, pp. 568–75.Google Scholar

Verma, M., Divya, , and Sofat, S. 2014. Article: Techniques to detect spammers in twitter- a survey. International Journal of Computer Applications 85 (10): 27–32.CrossRef Google Scholar

Wang, C., Chakrabarti, K., Cheng, T., and Chaudhuri, S. 2012. Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, Lyon, France, pp. 719–28.CrossRef Google Scholar

Wang, K., Thrasher, C., Viegas, E., Li, X., and Hsu, B.-J. P. 2010. An overview of microsoft web n-gram corpus and applications. In Proceedings of the NAACL HLT 2010, Los Angeles, California, USA, pp. 45–8.Google Scholar

Westerveld, T., Kraaij, W., and Hiemstra, D. 2002. Retrieving web pp. using content, links, urls and anchors. In Proceedings of the 10th Text REtrieval Conference, TREC 2001, vol. SP 500, Gaithersburg, Maryland, USA, pp. 663–72.Google Scholar

Winkels, M. 2013. The global social network landscape a country-by-country guide to social network usage. http://www.optimediaintelligence.es/noticias_archivos/719_20130715123913.pdf.Google Scholar

Wu, T.-F., Lin, C.-J., and Weng, R. C. 2004. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5 : 975–1005.Google Scholar

Yerva, S. R., Miklós, Z., and Aberer, K. 2012. Entity-based classification of twitter messages. IJCSA, 9 (1): 88–115.Google Scholar

Yosef, M., Hoffart, J., Bordino, I., Spaniol, M., and Weikum, G. 2011. Aida: An online tool for accurate disambiguation of named entities in text and tables. Proc. of the VLDB Endowment 4 (12): 1450–53.CrossRef Google Scholar

Zhai, C. and Lafferty, J. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, New Orleans, Louisiana, USA, pp. 334–42.Google Scholar

Zhu, Z., Hiemstra, D., Apers, P. M. G., and Wombacher, A. 2012. Separate training for conditional random fields using co-occurrence rate factorization. Technical Report TR-CTIT-12-29, Centre for Telematics and Information Technology, University of Twente, Enschede.Google Scholar

Zhu, Z., Hiemstra, D., Apers, P. M. G., and Wombacher, A. 2013. Closed form maximum likelihood estimator of conditional random fields. Technical Report TR-CTIT-13-03, Centre for Telematics and Information Technology, University of Twente, Enschede.Google Scholar

Article contents

TwitterNEED: A hybrid approach for named entity extraction and disambiguation for tweet*

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests