Selection of correction candidates for the normalization of Spanish user-generated content

M. MELERO; M.R. COSTA-JUSSÀ; P. LAMBERT; M. QUIXAL

doi:10.1017/S1351324914000011

Selection of correction candidates for the normalization of Spanish user-generated content

Published online by Cambridge University Press: 24 February 2014

M. MELERO ,

M.R. COSTA-JUSSÀ ,

P. LAMBERT and

M. QUIXAL

Show author details

M. MELERO: Affiliation:
Grup de Lingüística Computacional, Universitat Pompeu Fabra Roc Boronat, 138, 08018 Barcelona, Catalunya, Spain e-mail: maite.melero@upf.edu, patrik.lambert@upf.edu
M.R. COSTA-JUSSÀ: Affiliation:
Institute for Infocomm Research, Human Language Technology Group 1 Fusionopolis Way, 21-01 Connexis (South Tower), Singapore 138632 e-mail: vismrc@i2r.a-star.edu.sg
P. LAMBERT: Affiliation:
Grup de Lingüística Computacional, Universitat Pompeu Fabra Roc Boronat, 138, 08018 Barcelona, Catalunya, Spain e-mail: maite.melero@upf.edu, patrik.lambert@upf.edu
M. QUIXAL: Affiliation:
Department of Spanish and Portuguese, The University of Texas at Austin150 W 21st Street, Austin, TX 78712, USA e-mail: marti.quixal@utexas.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 1 , January 2016 , pp. 135 - 161

DOI: https://doi.org/10.1017/S1351324914000011 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., and Passonneau, R. 2011. Sentiment analysis of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 30–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Ahmed, F., Luca, E. W. D., and Nürnberger, A. 2010. Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness. Research Journal on Computer Science and Computer Engineering with Applications (Polibits) 40: 39–48 (ISSN ).CrossRef Google Scholar

Alonso, L. 2010. Inisghts lingüísticos relativos a la normalización léxica de contenidos generados por usuarios. Subjetividad y Procesos cognitivos 14 (2): 20–31 (Printed ISSN: , electronic ISSN :).Google Scholar

Aminian, M., Avontuur, T., Azar, Z., Balemans, I., Elshof, L., Newell, R., van Noord, N., Ntavelos, A., and van Zaanen, M. 2012. Assigning part-of-speech to Dutch tweets. In Melero, M. (ed.), Workshop “NLP can u tag #user-generated-content?! via lrec-conf.org,”Language Resources and Evaluation Conference, pp. 9–14.Google Scholar

Axelrod, A., He, X., and Gao, J. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, pp. 355–62. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Bender, E. M., Morgan, J. T., Oxley, M., Zachry, M., Hutchinson, B., Marin, A., Zhang, B., and Ostendorf, M. 2011. Annotating social acts: authority claims and alignment moves in wikipedia talk pages. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 48–57. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Bermingham, A., and Smeaton, A. 2010. Classifying sentiment in microblogs: is brevity an advantage? In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1833–36. New York, NY: ACM.CrossRef Google Scholar

Bilmes, J. A., and Kirchhoff, K., 2003. Factored language models and generalized parallel backoff. In Proceedings of HLT/NACCL, Edmonton, Alberta, Canada, pp. 4–6.Google Scholar

Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction.In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 282–93.Google Scholar

Brody, S., and Diakopoulos, N. 2011. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! using word lengthening to detect sentiment in microblogs. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, pp. 562–70. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A., 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3): 157–74.CrossRef Google Scholar

Chung, T., and Gildea, D. 2009. Unsupervised tokenization for machine translation.In Conference on Empirical Methods in Natural Language Processing (EMNLP-09), Singapore, pp. 718–26.Google Scholar

Church, K., and Gale, W. 1990. Poor estimates of context are worse than none. In Third DARPA Workshop on Speech and Natural Language, Hidden Valley, PA.Google Scholar

Clark, E. 2003. Pre-processing very noisy text. In Proceedings of Workshop on Shallow Processing of Large Corpora, Lancaster University, UK. pp. 12–22.Google Scholar

Clark, E., and Araki, K. 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. In 12th Conference of the Pacific Association for Computational Linguistics (PACLING 2011), Kuala Lumpur, Malaysia, paper 16.Google Scholar

Cook, P., and Stevenson, S. 2009. An unsupervised model for text message normalization. In CALC '09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Morristown, NJ, pp. 71–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Damerau, F. J., 1964. A technique for computer detection and correction of errrors. Communications of the ACM 7: 171–6.CrossRef Google Scholar

Eisenstein, J. 2013. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 359–69. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Federico, M., Bertoldi, N., and Cettolo, M., 2008. Irstlm: an open source toolkit for handling large scale language models. In Interspeech, Brisbane, Australia, pp. 1618–21.CrossRef Google Scholar

Foster, J. 2010. “cba to check the spelling”: investigating parser performance on discussion forum posts. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, pp. 381–84. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Foster, J., Çetinoglu, Ö., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., Van Genabith, J., et al. 2011. #hardtoparse: Pos tagging and parsing the twitterverse. In Proceedings of the Workshop On Analyzing Microtext (AAAI 2011), pp. 20–5.Google Scholar

Gianfortoni, P., Adamson, D., and Rosé, C. P. 2011. Modeling of stylistic variation in social media with stretchy patterns. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties (DIALECTS '11), pp. 49–59. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Gouws, S., Metzler, D., Cai, C., and Hovy, E. 2011. Contextual bearing on linguistic variation in social media. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 20–29. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Grefenstette, G., and Tapanainen, P., 1994. What is a word, what is a sentence? Problems of Tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research, Budapest, Hungary, pp. 79–87.Google Scholar

Han, B., and Baldwin, T., 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the ACL Conference, Portland, OR, pp. 368–78.Google Scholar

Hassan, H., and Menezes, A. 2013. Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 1577–86. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Henríquez, Q. C. and Hernández, A. 2009. A n-gram based statistical machine translation approach for text normalization on chat-speak style communications. In CAW 2.0 Workshop, pp. 1–5.Google Scholar

Herring, S. 2012a. Discourse in Web 2.0: familiar, reconfigured, and emergent. In Tannen, D., and Tester, A. M. (eds.), Discourse 2.0: Language and New Media, Georgetown University Round Table on Languages and Linguistics 2011, Georgetown University, Washington, DC.Google Scholar

Herring, S. 2012b. Grammar and electronic communication. In Chapelle, C. (ed.), Encyclopedia of Applied Linguistics. Hoboken, NJ: Wiley-Blackwell. pp. 1–9.Google Scholar

Hodge, V. J., and Austin, J. 2003. A comparison of standard spell checking algorithms and novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering 15 (5), 1073–81.Google Scholar

Kobus, C., Yvon, F., and Damnati, G. 2008. Normalizing SMS: are two metaphors better than one?In Proceedings of the 22nd International Conference on Computational Linguistics (COLING '08), Manchester, UK, pp. 441–8.CrossRef Google Scholar

Koehn, P., and Hoang, H. 2007. Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 868–76.Google Scholar

Kukich, K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24 (4), 377–439.CrossRef Google Scholar

Lambert, P., and Banchs, R. E., 2006. Tuning machine translation parameters with SPSA. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Kyoto, Japan, pp. 190–6.Google Scholar

Liu, F., Weng, F., and Jiang, X. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL (1)), pp. 1035–44. Stroudsburg, PA: Association for Computer Linguistics.Google Scholar

Mangu, L., and Brill, E. 1997. Automatic rule acquisition for spelling correction. In Proceedings of the 14th International Conference on Machine Learning, pp. 734–41. Burlington, MA: Morgan Kaufmann.Google Scholar

Martins, B., and Silva, M. J., 2004. Spelling correction for search engine queries. In EsTAL – España for Natural Language Processing, Alicante, Spain, pp. 378–83.Google Scholar

Maynard, D., Bontcheva, K., and Rout, D. 2012. Challenges in developing opinion mining tools for social media. In Melero, M. (ed.), Workshop “@NLP can u tag #user-generated- content?! via lrec-conf.org,”Language Resources and Evaluation Conference, Istanbul., pp. 15–22.Google Scholar

Mays, E., Damerau, F. J., and Mercer, R. L., 1991. Context-based spelling correction. Information Processing and Management 27 (5): 517–522.CrossRef Google Scholar

Michelson, M., and Knoblock, C. A. 2005. Semantic annotation of unstructured and ungrammatical text. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1091–8.Google Scholar

Mohamed, E. 2011. The effect of automatic tokenization, vocalization, stemming, and POS tagging on Arabic dependency parsing. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, pp. 10–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Moore, R. C., and Lewis, W. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, Uppsala, Sweden, pp. 220–224. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Muñoz-García, O., and Navarro, C. 2012. Comparing user-generated content published in different social media sources. In Melero, M. (ed.), Workshop “NLP can u tag #user-generated-content?! via lrec-conf.org,”Language Resources and Evaluation Conference, Istanbul., pp. 1–8.Google Scholar

Pang, B., and Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1–135.CrossRef Google Scholar

Quixal, M., Badia, T., Benavent, F., Boullosa, J. R., Domingo, J., Grau, B., Massó, G., and Valentín, O. 2008. User-centred design of error correction tools. In Chair, N. C. C., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., and Tapias, D. (eds.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. Luxembourg: European Language Resources Association (ELRA), pp. 1985–9. http://www.lrec-conf.org/proceedings/lrec2008/.Google Scholar

Ritter, A., Clark, S., Mausam, , and Etzioni, O. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Workshop on Unsupervised Learning in NLP (EMNLP), Edinburgh, UK, pp. 1524–1534. Stroudsburg, PA: ACL.Google Scholar

Rodríguez, C., Banchs, R., Codina, J., and Grivolla, J. 2010. Cometa: semantic exploration of customer reviews to extract valuable information for business intelligence. Technical Report, Barcelona Media Innovation Center, Barcelona, Spain.Google Scholar

Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., and Estève, Y. 2011. Lium’s systems for the IWSLT 2011 speech translation tasks. In International Workshop on Spoken Language Translation, San Francisco, CA.Google Scholar

Spall, J. C., 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37 : 332–41.CrossRef Google Scholar

Spall, J. C., 1998. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Technical Digest 19 (4): 482–92.Google Scholar

Sproat, R., Black, A. W., Chen, S. F., Kumar, S., Ostendorf, M., and Richards, C., 2001. Normalization of non-standard words. Computer Speech & Language 15 (3): 287–333.Google Scholar

Stolcke, A. 2002. Srilm-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, pp. 257–86.Google Scholar

Toral, A. 2013. Hybrid selection of language model training data using linguistic information and perplexity. In Proceedings of the Second Workshop on Hybrid Approaches to Translation, Sofia, Bulgaria, pp. 8–12. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Toutanova, K., and Moore, R. C. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the Associtation for Computational Linguistics, Hong Kong, pp. 144–51.Google Scholar

Villegas, M., Brosa, M. I., and Bel, N. 1998. El léxico PAROLE del español. In XIV Congreso de la Sociedad Española para el Procesamiento del Lenguaje, pp. 85–9.Google Scholar

Zhu, C., Tang, J., Li, H., Ng, H. T., and Zhao, T. 2007. A unified tagging approach to text normalization. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 688–95. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Article contents

Selection of correction candidates for the normalization of Spanish user-generated content

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests