Arabic spelling error detection and correction†

MOHAMMED ATTIA; PAVEL PECINA; YOUNES SAMIH; KHALED SHAALAN; JOSEF VAN GENABITH

doi:10.1017/S1351324915000030

Arabic spelling error detection and correction†

Published online by Cambridge University Press: 18 March 2015

KHALED SHAALAN and

MOHAMMED ATTIA: Affiliation:
School of Computing, Dublin City University, Ireland, e-mail: mattia@computing.dcu.ie, josef@computing.dcu.ie Faculty of Engineering and IT, The British University in Dubai, UAE e-mail: khaled.shaalan@buid.ac.ae
PAVEL PECINA: Affiliation:
Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic e-mail: pecina@ufal.mff.cuni.cz
YOUNES SAMIH: Affiliation:
Department of Linguistics and Information Science, Heinrich-Heine-Universität Düsseldorf, Germany e-mail: samih@phil.uni-duesseldorf.de
KHALED SHAALAN: Affiliation:
Faculty of Engineering and IT, The British University in Dubai, UAE e-mail: khaled.shaalan@buid.ac.ae
JOSEF VAN GENABITH: Affiliation:
School of Computing, Dublin City University, Ireland, e-mail: mattia@computing.dcu.ie, josef@computing.dcu.ie

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 5 , September 2016 , pp. 751 - 773

DOI: https://doi.org/10.1017/S1351324915000030 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

†

We are grateful to our anonymous reviewers whose comments and suggestions have helped us to improve the paper considerably. This research is funded by the Irish Research Council for Science Engineering and Technology (IRCSET), the UAE National Research Foundation (NRF) (Grant No. 0514/2011), the Czech Science Foundation (grant no. P103/12/G084), DFG Collaborative Research Centre 991: The Structure of Representations in Language, Cognition, and Science (http://www.sfb991.uni-duesseldorf.de/sfb991), and the Science Foundation Ireland (Grant No. 07/CE/I1142) as part of the Centre for Next Generation Localisation (www.cngl.ie) at Dublin City University.

References

Alfaifi, A., and Atwell, E. 2012. Arabic learner corpora (ALC): a taxonomy of coding errors. In Proceedings of the 8th International Computing Conference in Arabic (ICCA 2012), Cairo, Egypt.Google Scholar

Alkanhal, M. I., Al-Badrashiny, M. A., Alghamdi, M. M., and Al-Qabbany, A. O., 2012. Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions. IEEE Transactions on Audio, Speech, and Language Processing 20 (7): 2111–2122.CrossRef Google Scholar

Attia, M., 2006. An ambiguity-controlled morphological analyzer for modern standard arabic modelling finite state networks. In The Challenge of Arabic for NLP/MT Conference, The British Computer Society. London, UK, pp. 48–67.Google Scholar

Attia, M., Pecina, P., Tounsi, L., Toral, A., and van Genabith, J. 2011. An Open-source finite state morphological transducer for modern standard arabic. In International Workshop on Finite State Methods and Natural Language Processing (FSMNLP), Blois, France, pp. 125–133.Google Scholar

Beesley, K., 1998. Arabic morphology using only finite-state operations. In The Workshop on Computational Approaches to Semitic Languages, Montreal, Quebec, pp. 50–57.CrossRef Google Scholar

Beesley, K., and Karttunen, L., 2003. Finite State Morphology. CSLI Studies in Computational Linguistics. Stanford, California: CSLI.Google Scholar

Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 286–293.Google Scholar

Brown, P. F., Della Pietra, V. J., de Souza, P. V., Lai, J. C., and Mercer, R. L., 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467–479.Google Scholar

Buckwalter, T., 2004a. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 31–34.Google Scholar

Buckwalter, T. 2004b. Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue number: LDC2004L02.Google Scholar

Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A., 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3–4): 157–174.CrossRef Google Scholar

Church, K. W., and Gale, W. A., 1991. Probability scoring for spelling correction. Statistics and Computing 1: 93–103.CrossRef Google Scholar

Damerau, F. J., 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3): 171–176.CrossRef Google Scholar

El Kholy, A., and Habash, N., 2010. Techniques for Arabic morphological detokenization and orthographic denormalization. In Proceedings of the Workshop on Semitic Languages in the Seventh International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, pp. 45–51.Google Scholar

Gao, J., Li, X., Micol, D., Quirk, C., and Sun, X., 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 358–366.Google Scholar

Habash, N., and Rambow, O., 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, US, pp. 573–580.Google Scholar

Haddad, B., and Yaseen, M., 2007. Detection and correction of non-words in Arabic: a hybrid approach. International Journal of Computer Processing of Oriental Languages 20: 237–257.CrossRef Google Scholar

Hajič, J., Smrž, O., Buckwalter, T., and Jin, H., 2005. Feature-based tagger of approximations of functional arabic morphology. In Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (TLT), Barcelona, Spain, pp. 53–64.Google Scholar

Han, B., and Baldwin, T., 2011. Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, pp. 368–378.Google Scholar

Han, J., and Kamber, M., 2006. Data Mining, Southeast Asia Edition: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar

Hassan, A., Noeman, S., and Hassan, H., 2008. Language independent text correction using finite state automata. In IJCNLP, Hyderabad, India, pp. 913–918.Google Scholar

Heift, T., and Rimrott, A., 2008. Learner responses to corrective feedback for spelling errors in CALL. System 36 (2): 196–213.CrossRef Google Scholar

Hulden, M., 2009a. Fast approximate string matching with finite automata. In Proceedings of the 25th Conference of the Spanish Society for Natural Language Processing (SEPLN), San Sebastian, Spain, pp. 57–64.Google Scholar

Hulden, M., 2009b. Foma: a finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics. Stroudsburg, PA, USA, pp. 29–32.Google Scholar

Kernigan, M., Church, K., and Gale, W. 1990. A spelling correction program based on a noisy channel model. AT & T Laboratories, 600 Mountain Ave., Murray Hill, NJ, pp. 205–210.Google Scholar

Kiraz, G. A. 2001. Computational Nonlinear Morphology: With Emphasis on Semitic Languages, Cambridge University. Cambridge, United Kingdom.CrossRef Google Scholar

Kukich, K., 1992. Techniques for automatically correcting words in text. Computing Surveys 24 (4): 377–439.CrossRef Google Scholar

Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (8): 707–710.Google Scholar

Magdy, W., and Darwish, K., 2006. Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 408–414.Google Scholar

Mitton, R., 1996. English Spelling and the Computer. Harlow, Essex: Longman Group.Google Scholar

Mooney, R. J., and Bunescu, R., 2005. ACM SIGKDD explorations newsletter. Natural Language Processing and Text Mining 7 (1): 3–10.Google Scholar

Moussa, M., Fakhr, M. W., and Darwish, K. 2012. Statistical denormalization for arabic text. In Proceedings of KONVENS 2012, Vienna, pp. 228–232.Google Scholar

Norvig, P. 2009. Natural language corpus data. In Segaran, T. and Hammerbacher, J. (eds.), Beautiful Data, pp. 219–242. Sebastopol, California: O’Reilly.Google Scholar

Och, F. J., and Genzel, D. 2013. Automatic spelling correction for machine translation. Patent US 20130144592 A1. June 6, 2013.Google Scholar

Oflazer, K., 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 73–90.Google Scholar

Parker, R., Graff, D., Chen, K., Kong, J., and Maeda, K. 2011. Arabic Gigaword Fifth Edition. LDC Catalog No.: LDC2011T11.Google Scholar

Ratcliffe, R. R. 1998. The Broken Plural Problem in Arabic and Comparative Semitic: Allomorphy and Analogy in Non-concatenative Morphology, Amsterdam Studies in the Theory and History of Linguistic Science, Series IV, Current issues in linguistic theory, vol. 168. Amsterdam, Philadelphia: J. Benjamins.CrossRef Google Scholar

Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C., 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, Columbus, Ohio, US, pp. 117–120.CrossRef Google Scholar

Shaalan, K., Allam, A., and Gomah, A., 2003. Towards automatic spell checking for arabic. In Proceedings of the 4th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, pp. 240–247.Google Scholar

Shaalan, K., Magdy, M., and Fahmy, A. 2013. Analysis and feedback of erroneous arabic verbs. Journal of Natural Language Engineering, Cambridge University, UK. FirstView: 1–53.Google Scholar

Shaalan, K., Samih, Y., Attia, M., Pecina, P., and van Genabith, J. 2012. Arabic word generation and modelling for spell checking. In Language Resources and Evaluation (LREC), Istanbul, Turkey. pp. 719–725.Google Scholar

Stolcke, A., Zheng, J., Wang, W., and Abrash, V. 2011. SRILM at sixteen: update and outlook. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Waikoloa, Hawaii.Google Scholar

Tong, X., and Evans, D. A., 1996. A statistical approach to automatic OCR error correction in context. In Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 88–100.Google Scholar

Ukkonen, E. 1983. On approximate string matching. In Foundations of Computation Theory, vol. 158, pp. 487–495. Lecture Notes in Computer Science, Berlin: Springer.CrossRef Google Scholar

van Delden, S., Bracewell, D. B., and Gomez, F. 2004. Supervised and unsupervised automatic spelling correction algorithms. In Proceedings of the 2004 IEEE International Conference on Web Services, pp. 530–535.Google Scholar

Watson, J. 2002. The Phonology and Morphology of Arabic, New York: Oxford University.CrossRef Google Scholar

Wintner, S., 2008. Strengths and weaknesses of finite-state technology: a case study in morphological grammar development. Natural Language Engineering 14 (4): 457–469.CrossRef Google Scholar

Wu, J., Chiu, H., and Chang, J. S., 2013. Integrating dictionary and web N-grams for chinese spell checking. Computational Linguistics and Chinese Language Processing 18 (4): 17–30.Google Scholar

Zaghouani, W., Mohit, B., Habash, N., Obeid, O., Tomeh, N., Rozovskaya, A., Farra, N., Alkuhlani, S., and Oflazer, K., 2014. Large scale arabic error annotation: guidelines and framework. In The 9th Edition of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, pp. 26–31.Google Scholar

Zribi, C. B. O., and Ben Ahmed, M. 2003. Efficient automatic correction of misspelled arabic words based on contextual information. Lecture Notes in Computer Science, Springer, 2773: 770–777.Google Scholar

Article contents

Arabic spelling error detection and correction†

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests