NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic*

MAI OUDAH; KHALED SHAALAN

doi:10.1017/S1351324916000097

NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic*

Published online by Cambridge University Press: 06 May 2016

MAI OUDAH

and

KHALED SHAALAN

Show author details

MAI OUDAH: Affiliation:
Masdar Institute of Science and Technology, Abu Dhabi, UAE e-mail: moudah@masdar.ac.ae
KHALED SHAALAN: Affiliation:
The British University in Dubai, Dubai International Academic City, Dubai, UAE e-mail: khaled.shaalan@buid.ac.ae

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 3 , May 2017 , pp. 441 - 472

DOI: https://doi.org/10.1017/S1351324916000097 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This research was partially funded by the British University in Dubai (Grant No. INF004-Using machine learning to improve Arabic named entity recognition).

References

Abdallah, S., Shaalan, K. and Shoaib, M. 2012. Integrating rule-based system with classification for arabic named entity recognition. In Proceedings of the 13th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Springer-Verlag, Berlin Heidelberg, pp. 311–22.CrossRef Google Scholar

AbdelRahman, S., Elarnaoty, M., Magdy, M., and Fahmy, A. 2010. Integrated machine learning techniques for arabic named entity recognition. International Journal of Computer Science Issues (IJCSI) 7 (3): 27–36.Google Scholar

Abdul-Hamid, A., and Darwish, K. 2010. Simplified feature set for arabic named entity recognition. In Proceedings of the 2010 Named Entities Workshop (ACL 2010), Stroudsburg, PA, USA, pp. 110–5.Google Scholar

Aboaoga, M. and Aziz, M. J. A. 2013. Arabic person names recognition by using a rule based approach. Journal of Computer Science 9 (7): 922–7.Google Scholar

Abouenour, L., Bouzoubaa, K. and Rosso, P. 2012. IDRAAQ: new arabic question answering system based on query expansion and passage retrieval. CLEF (Online Working Notes/Labs/Workshop).Google Scholar

Alias, I. 2008. ‘LingPipe 4.1.0’. http://alias-i.com/lingpipe (accessed October 2012).Google Scholar

Al-Sughaiyer, I., and Al-Kharashi, A. 2004. Arabic morphological analysis techniques: a comprehensive survey. Journal of the American Society for Information Science and Technology 55 (3): 189–213.Google Scholar

Azmi, A. M. and Badia, N. 2010. e-Narrator - An application for creating an ontology of Hadiths narration tree semantically and graphically. The Arabian Journal for Science and Engineering 35 (2C): 51–68.Google Scholar

Babych, B. and Hartley, A. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT (EAMT 2003), Stroudsburg, PA, USA, pp. 1–8.Google Scholar

Benajiba, Y., Diab, M. and Rosso, P. 2008a. Arabic named entity recognition using optimized feature sets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), Stroudsburg, PA, USA, pp. 284–93.Google Scholar

Benajiba, Y., Diab, M. and Rosso, P. 2008b. Arabic named entity recognition: an svm-based approach. In Proceedings of Arab International Conference on Information Technology (ACIT 2008), Hamamamet, Tunisia, pp. 16–8.Google Scholar

Benajiba, Y., Diab, M. and Rosso, P. 2009a. Arabic named entity recognition: a feature-driven study. IEEE Transactions on Audio, Speech, and Language Processing 17 (5): 926–34.Google Scholar

Benajiba, Y., Diab, M. and Rosso, P. 2009b. Using language independent and language specific features to enhance arabic named entity recognition. The International Arab Journal of Information Technology 6 (5): 464–73.Google Scholar

Benajiba, Y. and Rosso, P. 2007. ANERsys 2.0: conquering the ner task for the arabic language by combining the maximum entropy with POS-tag information. In Proceedings of Workshop on Natural Language-Independent Engineering, 3rd Indian International Conference on Artificial Intelligence (IICAI-2007), Pune, India, pp. 1814–23.Google Scholar

Benajiba, Y. and Rosso, P. 2008. Arabic named entity recognition using conditional random fields. In Proceedings of Workshop on HLT & NLP within the Arabic World (LREC 2008), Marrakech, Morocco, pp. 143–53.Google Scholar

Benajiba, Y., Rosso, P. and Bened'I, J. M. 2007. ANERsys: an arabic named entity recognition system based on maximum entropy. In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing-2007), Springer-Verlag, Berlin, Heidelberg, pp. 143–53.Google Scholar

Collins, M. 2002. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, pp. 1–8.Google Scholar

Cowie, J. and Wilks, Y. 1996. Information extraction. Communications of the ACM 39 (1): 80–91.Google Scholar

Cunningham, H., et al. 2011. Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science, Gateway Press CA, USA.Google Scholar

Darwish, K. and Magdy, W. 2014. Arabic information retrieval. Foundations and Trends in Information Retrieval 7 (4): 239–342.Google Scholar

Elsebai, A., Meziane, F. and BelKredim, F. Z. 2009. A rule based persons names arabic extraction system. Communications of the IBIMA 11 (6): 53–9.Google Scholar

Farber, B., Freitag, D., Habash, N. and Rambow, O. 2008. Improving NER in arabic using a morphological tagger. In Proceedings of Workshop on HLT & NLP within the Arabic World (LREC 2008), Marrakech, Morocco, pp. 2509–14.Google Scholar

Farghaly, A. and Shaalan, K. 2009. Arabic natural language processing: challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP) 8 (4): 1–22.Google Scholar

Finkel, J. and Manning, C. 2009. Nested named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, pp. 141–50.Google Scholar

Habash, N., Owen, R. and Ryan, R. 2009. MADA+TOKAN: a toolkit for arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, pp. 102–10.Google Scholar

Habash, N., Owen, R. and Ryan, R. 2010. MADA+TOKAN Manual. Technical Report CCLS-10-01, Center for Computational Learning Systems (CCLS), Columbia University.Google Scholar

Habash, N. and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), Stroudsburg, PA, USA, pp. 573–80.Google Scholar

Habash, N., Soudi, A. and Buckwalter, T. 2007. On arabic transliteration. Arabic Computational Morphology: Knowledge-based and Empirical Methods 38: 15–22.CrossRef Google Scholar

Habash, N. Y. 2010. Introduction to Arabic Natural Language Processing. Mogran & Claypool Publisher, San Rafael, California, USA.CrossRef Google Scholar

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 10–8.Google Scholar

Hamadene, A., Shaheen, M. and Badawy, O. 2011. ARQA: an intelligent arabic question answering system. In Proceedings of Arabic Language Technology International Conference (ALTIC 2011), Alexandria, Egypt, pp. 1–9.Google Scholar

Küçük, D. and Yazici, A. 2012. A hybrid named entity recognizer for Turkish. Expert Systems with Applications 39 (3): 2733–42.Google Scholar

Maloney, J. and Niv, M. 1998. TAGARAB: a fast, accurate arabic name recognizer using high-precision morphological analysis. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (Semitic 1998), Stroudsburg, PA, USA, pp. 8–15.Google Scholar

Mayfield, J., McNamee, P., and Piatko, C. 2003. Named entity recognition using hundreds of thousands of features. In Proceedings of the 7th conference on Natural language learning at HLT-NAACL 2003 (CONLL 2003), Stroudsburg, PA, USA, pp. 184–7.Google Scholar

Maynard, D., Tablan, V., Ursu, C., Cunningham, H., and Wilks, Y. 2001. Named entity recognition from diverse text types. In Proceedings of Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark, Bulgaria.Google Scholar

Mesfar, S. 2007. Named entity recognition for arabic using syntactic grammars. In Proceedings of the 12th International Conference on Application of Natural Language to Information Systems, Springer-Verlag, Berlin, Heidelberg, pp. 305–16.Google Scholar

Mitchell, A., Strassel, S., Huang, S., and Zakhary, R. 2005. ACE 2004 Multilingual Training Corpus, Philadelphia, PA: Linguistic Data Consortium.Google Scholar

Mohammed, N. F. and Omar, N. 2012. Arabic named entity recognition using artificial neural network. Journal of Computer Science 8 (8): 1285–93.Google Scholar

Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (1): 3–26.Google Scholar

Oudah, M. and Shaalan, K. 2013. Person Name Recognition Using the Hybrid Approach, vol. 7934, Lecture Notes in Computer Science, Natural Language Processing and Information Systems. Berlin Heidelberg: Springer, pp. 237–48.CrossRef Google Scholar

Oudah, M. M. and Shaalan, K. 2012. A pipeline arabic named entity recognition using a hybrid approach. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India, pp. 2159–76.Google Scholar

Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., and Spyropoulos, C. D. 2001. Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceeding Conference of Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 426–33.Google Scholar

Riaz, K. 2010. Rule-based named entity recognition in urdu. In Proceedings of the 2010 Named Entities Workshop (ACL 2010), Stroudsburg, PA, USA, pp. 126–35.Google Scholar

Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Short Papers, Stroudsberg, PA, USA, pp. 117–20.Google Scholar

Salloum, W. and Habash, N. 2012. Elissa: a dialectal to standard arabic machine translation system. In Proceedings of the International Conference on Computational Linguistics, Mumbai, India, pp. 385–92.Google Scholar

Seon, C., Ko, Y., Kim, J. and Seo, J. 2001. Named entity recognition using machine learning methods and pattern-selection rules. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, Tokyo, Japan, pp. 229–36.Google Scholar

Shaalan, K. 2010. Rule-based approach in arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT) 3 (3): 11–9.Google Scholar

Shaalan, K. 2014. A survey of arabic named entity recognition and classification. Computational Linguistics 40 (2):469–80.Google Scholar

Shaalan, K. and Oudah, M. 2014. A hybrid approach to arabic named entity recognition, Journal of Information Science (JIS) 40 (1): 67–87.CrossRef Google Scholar

Shaalan, K. and Raza, H. 2007. Person name entity recognition for arabic. In Proceedings of the 5th Workshop on Important Unresolved Matters, Prague, Czech Republic, pp. 17–24.Google Scholar

Shaalan, K. and Raza, H. 2008. Arabic named entity recognition from diverse text types. In Proceedings of the 6th International Conference on Natural Language Processing (GoTAL 2008), Berlin, Heidel-berg: Springer-Verlag, pp. 440–51.Google Scholar

Shaalan, K. and Raza, H. 2009. NERA: named entity recognition for arabic. Journal of the American Society for Information Science and Technology 60 (8): 1652–63.CrossRef Google Scholar

Shaalan, K., Abo Bakr, H., and Ziedan, I. 2009. A hybrid approach for building arabic diacritizer. In Proceedings of the 12th European Chapter of the Association for Computational Linguistics (EACL 2009) Workshop on Computational Approaches to Semitic Languages, Association for Computational Linguistics, Athens, Greece, pp. 27–35.Google Scholar

Shaalan, K., Monem, A. and Rafea, A. 2007. Arabic morphological generation from interlingua: a rule-based approach. In Proceedings of the 4th International Conference on Intelligent Information Processing, Adelaide, Australia, pp. 441–51.Google Scholar

Srihari, R., Niu, C. and Li, W. 2000. A hybrid approach for named entity and sub-type tagging. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLC 2000), Stroudsburg, PA, USA, pp. 247–54.Google Scholar

Tsai, T., Wu, S., Lee, C., Shih, C., and Hsu, W. 2004. Mencius: a Chinese named entity recognizer using the maximum entropy-based hybrid model. Computational Linguistics and Chinese Language Processing 9 (1): 65–82.Google Scholar

Zaghouani, W. 2012. RENAR: a rule-based arabic named entity recognition system. ACM Transactions on Asian Language Information Processing 11 (1): 1–13.Google Scholar

Zhou, G. and Su, J. 2002. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA, USA, pp. 473–80.Google Scholar

Zirikly, A. and Diab, M. 2015. Named entity recognition for arabic social media. In Proceedings of NAACL-HLT 2015, Denver, Colorado, USA, pp. 176–85.Google Scholar

Article contents

NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic*

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests