Hostname: page-component-76fb5796d-wq484 Total loading time: 0 Render date: 2024-04-25T16:43:26.275Z Has data issue: false hasContentIssue false

NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic*

Published online by Cambridge University Press:  06 May 2016

MAI OUDAH
Affiliation:
Masdar Institute of Science and Technology, Abu Dhabi, UAE e-mail: moudah@masdar.ac.ae
KHALED SHAALAN
Affiliation:
The British University in Dubai, Dubai International Academic City, Dubai, UAE e-mail: khaled.shaalan@buid.ac.ae

Abstract

Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

This research was partially funded by the British University in Dubai (Grant No. INF004-Using machine learning to improve Arabic named entity recognition).

References

Abdallah, S., Shaalan, K. and Shoaib, M. 2012. Integrating rule-based system with classification for arabic named entity recognition. In Proceedings of the 13th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Springer-Verlag, Berlin Heidelberg, pp. 311–22.CrossRefGoogle Scholar
AbdelRahman, S., Elarnaoty, M., Magdy, M., and Fahmy, A. 2010. Integrated machine learning techniques for arabic named entity recognition. International Journal of Computer Science Issues (IJCSI) 7 (3): 2736.Google Scholar
Abdul-Hamid, A., and Darwish, K. 2010. Simplified feature set for arabic named entity recognition. In Proceedings of the 2010 Named Entities Workshop (ACL 2010), Stroudsburg, PA, USA, pp. 110–5.Google Scholar
Aboaoga, M. and Aziz, M. J. A. 2013. Arabic person names recognition by using a rule based approach. Journal of Computer Science 9 (7): 922–7.Google Scholar
Abouenour, L., Bouzoubaa, K. and Rosso, P. 2012. IDRAAQ: new arabic question answering system based on query expansion and passage retrieval. CLEF (Online Working Notes/Labs/Workshop).Google Scholar
Alias, I. 2008. ‘LingPipe 4.1.0’. http://alias-i.com/lingpipe (accessed October 2012).Google Scholar
Al-Sughaiyer, I., and Al-Kharashi, A. 2004. Arabic morphological analysis techniques: a comprehensive survey. Journal of the American Society for Information Science and Technology 55 (3): 189213.Google Scholar
Azmi, A. M. and Badia, N. 2010. e-Narrator - An application for creating an ontology of Hadiths narration tree semantically and graphically. The Arabian Journal for Science and Engineering 35 (2C): 5168.Google Scholar
Babych, B. and Hartley, A. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT (EAMT 2003), Stroudsburg, PA, USA, pp. 18.Google Scholar
Benajiba, Y., Diab, M. and Rosso, P. 2008a. Arabic named entity recognition using optimized feature sets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), Stroudsburg, PA, USA, pp. 284–93.Google Scholar
Benajiba, Y., Diab, M. and Rosso, P. 2008b. Arabic named entity recognition: an svm-based approach. In Proceedings of Arab International Conference on Information Technology (ACIT 2008), Hamamamet, Tunisia, pp. 16–8.Google Scholar
Benajiba, Y., Diab, M. and Rosso, P. 2009a. Arabic named entity recognition: a feature-driven study. IEEE Transactions on Audio, Speech, and Language Processing 17 (5): 926–34.Google Scholar
Benajiba, Y., Diab, M. and Rosso, P. 2009b. Using language independent and language specific features to enhance arabic named entity recognition. The International Arab Journal of Information Technology 6 (5): 464–73.Google Scholar
Benajiba, Y. and Rosso, P. 2007. ANERsys 2.0: conquering the ner task for the arabic language by combining the maximum entropy with POS-tag information. In Proceedings of Workshop on Natural Language-Independent Engineering, 3rd Indian International Conference on Artificial Intelligence (IICAI-2007), Pune, India, pp. 1814–23.Google Scholar
Benajiba, Y. and Rosso, P. 2008. Arabic named entity recognition using conditional random fields. In Proceedings of Workshop on HLT & NLP within the Arabic World (LREC 2008), Marrakech, Morocco, pp. 143–53.Google Scholar
Benajiba, Y., Rosso, P. and Bened'I, J. M. 2007. ANERsys: an arabic named entity recognition system based on maximum entropy. In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing-2007), Springer-Verlag, Berlin, Heidelberg, pp. 143–53.Google Scholar
Collins, M. 2002. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, pp. 18.Google Scholar
Cowie, J. and Wilks, Y. 1996. Information extraction. Communications of the ACM 39 (1): 8091.Google Scholar
Cunningham, H., et al. 2011. Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science, Gateway Press CA, USA.Google Scholar
Darwish, K. and Magdy, W. 2014. Arabic information retrieval. Foundations and Trends in Information Retrieval 7 (4): 239342.Google Scholar
Elsebai, A., Meziane, F. and BelKredim, F. Z. 2009. A rule based persons names arabic extraction system. Communications of the IBIMA 11 (6): 53–9.Google Scholar
Farber, B., Freitag, D., Habash, N. and Rambow, O. 2008. Improving NER in arabic using a morphological tagger. In Proceedings of Workshop on HLT & NLP within the Arabic World (LREC 2008), Marrakech, Morocco, pp. 2509–14.Google Scholar
Farghaly, A. and Shaalan, K. 2009. Arabic natural language processing: challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP) 8 (4): 122.Google Scholar
Finkel, J. and Manning, C. 2009. Nested named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, pp. 141–50.Google Scholar
Habash, N., Owen, R. and Ryan, R. 2009. MADA+TOKAN: a toolkit for arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, pp. 102–10.Google Scholar
Habash, N., Owen, R. and Ryan, R. 2010. MADA+TOKAN Manual. Technical Report CCLS-10-01, Center for Computational Learning Systems (CCLS), Columbia University.Google Scholar
Habash, N. and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), Stroudsburg, PA, USA, pp. 573–80.Google Scholar
Habash, N., Soudi, A. and Buckwalter, T. 2007. On arabic transliteration. Arabic Computational Morphology: Knowledge-based and Empirical Methods 38: 1522.CrossRefGoogle Scholar
Habash, N. Y. 2010. Introduction to Arabic Natural Language Processing. Mogran & Claypool Publisher, San Rafael, California, USA.CrossRefGoogle Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 10–8.Google Scholar
Hamadene, A., Shaheen, M. and Badawy, O. 2011. ARQA: an intelligent arabic question answering system. In Proceedings of Arabic Language Technology International Conference (ALTIC 2011), Alexandria, Egypt, pp. 19.Google Scholar
Küçük, D. and Yazici, A. 2012. A hybrid named entity recognizer for Turkish. Expert Systems with Applications 39 (3): 2733–42.Google Scholar
Maloney, J. and Niv, M. 1998. TAGARAB: a fast, accurate arabic name recognizer using high-precision morphological analysis. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (Semitic 1998), Stroudsburg, PA, USA, pp. 815.Google Scholar
Mayfield, J., McNamee, P., and Piatko, C. 2003. Named entity recognition using hundreds of thousands of features. In Proceedings of the 7th conference on Natural language learning at HLT-NAACL 2003 (CONLL 2003), Stroudsburg, PA, USA, pp. 184–7.Google Scholar
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., and Wilks, Y. 2001. Named entity recognition from diverse text types. In Proceedings of Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark, Bulgaria.Google Scholar
Mesfar, S. 2007. Named entity recognition for arabic using syntactic grammars. In Proceedings of the 12th International Conference on Application of Natural Language to Information Systems, Springer-Verlag, Berlin, Heidelberg, pp. 305–16.Google Scholar
Mitchell, A., Strassel, S., Huang, S., and Zakhary, R. 2005. ACE 2004 Multilingual Training Corpus, Philadelphia, PA: Linguistic Data Consortium.Google Scholar
Mohammed, N. F. and Omar, N. 2012. Arabic named entity recognition using artificial neural network. Journal of Computer Science 8 (8): 1285–93.Google Scholar
Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (1): 326.Google Scholar
Oudah, M. and Shaalan, K. 2013. Person Name Recognition Using the Hybrid Approach, vol. 7934, Lecture Notes in Computer Science, Natural Language Processing and Information Systems. Berlin Heidelberg: Springer, pp. 237–48.CrossRefGoogle Scholar
Oudah, M. M. and Shaalan, K. 2012. A pipeline arabic named entity recognition using a hybrid approach. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India, pp. 2159–76.Google Scholar
Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., and Spyropoulos, C. D. 2001. Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceeding Conference of Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 426–33.Google Scholar
Riaz, K. 2010. Rule-based named entity recognition in urdu. In Proceedings of the 2010 Named Entities Workshop (ACL 2010), Stroudsburg, PA, USA, pp. 126–35.Google Scholar
Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Short Papers, Stroudsberg, PA, USA, pp. 117–20.Google Scholar
Salloum, W. and Habash, N. 2012. Elissa: a dialectal to standard arabic machine translation system. In Proceedings of the International Conference on Computational Linguistics, Mumbai, India, pp. 385–92.Google Scholar
Seon, C., Ko, Y., Kim, J. and Seo, J. 2001. Named entity recognition using machine learning methods and pattern-selection rules. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, Tokyo, Japan, pp. 229–36.Google Scholar
Shaalan, K. 2010. Rule-based approach in arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT) 3 (3): 11–9.Google Scholar
Shaalan, K. 2014. A survey of arabic named entity recognition and classification. Computational Linguistics 40 (2):469–80.Google Scholar
Shaalan, K. and Oudah, M. 2014. A hybrid approach to arabic named entity recognition, Journal of Information Science (JIS) 40 (1): 6787.CrossRefGoogle Scholar
Shaalan, K. and Raza, H. 2007. Person name entity recognition for arabic. In Proceedings of the 5th Workshop on Important Unresolved Matters, Prague, Czech Republic, pp. 1724.Google Scholar
Shaalan, K. and Raza, H. 2008. Arabic named entity recognition from diverse text types. In Proceedings of the 6th International Conference on Natural Language Processing (GoTAL 2008), Berlin, Heidel-berg: Springer-Verlag, pp. 440–51.Google Scholar
Shaalan, K. and Raza, H. 2009. NERA: named entity recognition for arabic. Journal of the American Society for Information Science and Technology 60 (8): 1652–63.CrossRefGoogle Scholar
Shaalan, K., Abo Bakr, H., and Ziedan, I. 2009. A hybrid approach for building arabic diacritizer. In Proceedings of the 12th European Chapter of the Association for Computational Linguistics (EACL 2009) Workshop on Computational Approaches to Semitic Languages, Association for Computational Linguistics, Athens, Greece, pp. 2735.Google Scholar
Shaalan, K., Monem, A. and Rafea, A. 2007. Arabic morphological generation from interlingua: a rule-based approach. In Proceedings of the 4th International Conference on Intelligent Information Processing, Adelaide, Australia, pp. 441–51.Google Scholar
Srihari, R., Niu, C. and Li, W. 2000. A hybrid approach for named entity and sub-type tagging. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLC 2000), Stroudsburg, PA, USA, pp. 247–54.Google Scholar
Tsai, T., Wu, S., Lee, C., Shih, C., and Hsu, W. 2004. Mencius: a Chinese named entity recognizer using the maximum entropy-based hybrid model. Computational Linguistics and Chinese Language Processing 9 (1): 6582.Google Scholar
Zaghouani, W. 2012. RENAR: a rule-based arabic named entity recognition system. ACM Transactions on Asian Language Information Processing 11 (1): 113.Google Scholar
Zhou, G. and Su, J. 2002. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA, USA, pp. 473–80.Google Scholar
Zirikly, A. and Diab, M. 2015. Named entity recognition for arabic social media. In Proceedings of NAACL-HLT 2015, Denver, Colorado, USA, pp. 176–85.Google Scholar