Hostname: page-component-8448b6f56d-t5pn6 Total loading time: 0 Render date: 2024-04-18T19:54:43.964Z Has data issue: false hasContentIssue false

Selection of correction candidates for the normalization of Spanish user-generated content

Published online by Cambridge University Press:  24 February 2014

M. MELERO
Affiliation:
Grup de Lingüística Computacional, Universitat Pompeu Fabra Roc Boronat, 138, 08018 Barcelona, Catalunya, Spain e-mail: maite.melero@upf.edu, patrik.lambert@upf.edu
M.R. COSTA-JUSSÀ
Affiliation:
Institute for Infocomm Research, Human Language Technology Group 1 Fusionopolis Way, 21-01 Connexis (South Tower), Singapore 138632 e-mail: vismrc@i2r.a-star.edu.sg
P. LAMBERT
Affiliation:
Grup de Lingüística Computacional, Universitat Pompeu Fabra Roc Boronat, 138, 08018 Barcelona, Catalunya, Spain e-mail: maite.melero@upf.edu, patrik.lambert@upf.edu
M. QUIXAL
Affiliation:
Department of Spanish and Portuguese, The University of Texas at Austin150 W 21st Street, Austin, TX 78712, USA e-mail: marti.quixal@utexas.edu

Abstract

We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., and Passonneau, R. 2011. Sentiment analysis of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 30–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Ahmed, F., Luca, E. W. D., and Nürnberger, A. 2010. Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness. Research Journal on Computer Science and Computer Engineering with Applications (Polibits) 40: 3948 (ISSN ).CrossRefGoogle Scholar
Alonso, L. 2010. Inisghts lingüísticos relativos a la normalización léxica de contenidos generados por usuarios. Subjetividad y Procesos cognitivos 14 (2): 2031 (Printed ISSN: , electronic ISSN :).Google Scholar
Aminian, M., Avontuur, T., Azar, Z., Balemans, I., Elshof, L., Newell, R., van Noord, N., Ntavelos, A., and van Zaanen, M. 2012. Assigning part-of-speech to Dutch tweets. In Melero, M. (ed.), Workshop “NLP can u tag #user-generated-content?! via lrec-conf.org,”Language Resources and Evaluation Conference, pp. 914.Google Scholar
Axelrod, A., He, X., and Gao, J. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, pp. 355–62. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Bender, E. M., Morgan, J. T., Oxley, M., Zachry, M., Hutchinson, B., Marin, A., Zhang, B., and Ostendorf, M. 2011. Annotating social acts: authority claims and alignment moves in wikipedia talk pages. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 4857. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Bermingham, A., and Smeaton, A. 2010. Classifying sentiment in microblogs: is brevity an advantage? In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1833–36. New York, NY: ACM.CrossRefGoogle Scholar
Bilmes, J. A., and Kirchhoff, K., 2003. Factored language models and generalized parallel backoff. In Proceedings of HLT/NACCL, Edmonton, Alberta, Canada, pp. 46.Google Scholar
Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction.In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 282–93.Google Scholar
Brody, S., and Diakopoulos, N. 2011. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! using word lengthening to detect sentiment in microblogs. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, pp. 562–70. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A., 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3): 157–74.CrossRefGoogle Scholar
Chung, T., and Gildea, D. 2009. Unsupervised tokenization for machine translation.In Conference on Empirical Methods in Natural Language Processing (EMNLP-09), Singapore, pp. 718–26.Google Scholar
Church, K., and Gale, W. 1990. Poor estimates of context are worse than none. In Third DARPA Workshop on Speech and Natural Language, Hidden Valley, PA.Google Scholar
Clark, E. 2003. Pre-processing very noisy text. In Proceedings of Workshop on Shallow Processing of Large Corpora, Lancaster University, UK. pp. 1222.Google Scholar
Clark, E., and Araki, K. 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. In 12th Conference of the Pacific Association for Computational Linguistics (PACLING 2011), Kuala Lumpur, Malaysia, paper 16.Google Scholar
Cook, P., and Stevenson, S. 2009. An unsupervised model for text message normalization. In CALC '09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Morristown, NJ, pp. 71–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Damerau, F. J., 1964. A technique for computer detection and correction of errrors. Communications of the ACM 7: 171–6.CrossRefGoogle Scholar
Eisenstein, J. 2013. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 359–69. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Federico, M., Bertoldi, N., and Cettolo, M., 2008. Irstlm: an open source toolkit for handling large scale language models. In Interspeech, Brisbane, Australia, pp. 1618–21.CrossRefGoogle Scholar
Foster, J. 2010. “cba to check the spelling”: investigating parser performance on discussion forum posts. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, pp. 381–84. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Foster, J., Çetinoglu, Ö., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., Van Genabith, J., et al. 2011. #hardtoparse: Pos tagging and parsing the twitterverse. In Proceedings of the Workshop On Analyzing Microtext (AAAI 2011), pp. 20–5.Google Scholar
Gianfortoni, P., Adamson, D., and Rosé, C. P. 2011. Modeling of stylistic variation in social media with stretchy patterns. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties (DIALECTS '11), pp. 4959. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Gouws, S., Metzler, D., Cai, C., and Hovy, E. 2011. Contextual bearing on linguistic variation in social media. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 2029. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Grefenstette, G., and Tapanainen, P., 1994. What is a word, what is a sentence? Problems of Tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research, Budapest, Hungary, pp. 7987.Google Scholar
Han, B., and Baldwin, T., 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the ACL Conference, Portland, OR, pp. 368–78.Google Scholar
Hassan, H., and Menezes, A. 2013. Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 1577–86. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Henríquez, Q. C. and Hernández, A. 2009. A n-gram based statistical machine translation approach for text normalization on chat-speak style communications. In CAW 2.0 Workshop, pp. 1–5.Google Scholar
Herring, S. 2012a. Discourse in Web 2.0: familiar, reconfigured, and emergent. In Tannen, D., and Tester, A. M. (eds.), Discourse 2.0: Language and New Media, Georgetown University Round Table on Languages and Linguistics 2011, Georgetown University, Washington, DC.Google Scholar
Herring, S. 2012b. Grammar and electronic communication. In Chapelle, C. (ed.), Encyclopedia of Applied Linguistics. Hoboken, NJ: Wiley-Blackwell. pp. 19.Google Scholar
Hodge, V. J., and Austin, J. 2003. A comparison of standard spell checking algorithms and novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering 15 (5), 1073–81.Google Scholar
Kobus, C., Yvon, F., and Damnati, G. 2008. Normalizing SMS: are two metaphors better than one?In Proceedings of the 22nd International Conference on Computational Linguistics (COLING '08), Manchester, UK, pp. 441–8.CrossRefGoogle Scholar
Koehn, P., and Hoang, H. 2007. Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 868–76.Google Scholar
Kukich, K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24 (4), 377439.CrossRefGoogle Scholar
Lambert, P., and Banchs, R. E., 2006. Tuning machine translation parameters with SPSA. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Kyoto, Japan, pp. 190–6.Google Scholar
Liu, F., Weng, F., and Jiang, X. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL (1)), pp. 1035–44. Stroudsburg, PA: Association for Computer Linguistics.Google Scholar
Mangu, L., and Brill, E. 1997. Automatic rule acquisition for spelling correction. In Proceedings of the 14th International Conference on Machine Learning, pp. 734–41. Burlington, MA: Morgan Kaufmann.Google Scholar
Martins, B., and Silva, M. J., 2004. Spelling correction for search engine queries. In EsTAL – España for Natural Language Processing, Alicante, Spain, pp. 378–83.Google Scholar
Maynard, D., Bontcheva, K., and Rout, D. 2012. Challenges in developing opinion mining tools for social media. In Melero, M. (ed.), Workshop “@NLP can u tag #user-generated- content?! via lrec-conf.org,”Language Resources and Evaluation Conference, Istanbul., pp. 1522.Google Scholar
Mays, E., Damerau, F. J., and Mercer, R. L., 1991. Context-based spelling correction. Information Processing and Management 27 (5): 517522.CrossRefGoogle Scholar
Michelson, M., and Knoblock, C. A. 2005. Semantic annotation of unstructured and ungrammatical text. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1091–8.Google Scholar
Mohamed, E. 2011. The effect of automatic tokenization, vocalization, stemming, and POS tagging on Arabic dependency parsing. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, pp. 10–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Moore, R. C., and Lewis, W. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, Uppsala, Sweden, pp. 220224. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Muñoz-García, O., and Navarro, C. 2012. Comparing user-generated content published in different social media sources. In Melero, M. (ed.), Workshop “NLP can u tag #user-generated-content?! via lrec-conf.org,”Language Resources and Evaluation Conference, Istanbul., pp. 18.Google Scholar
Pang, B., and Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1135.CrossRefGoogle Scholar
Quixal, M., Badia, T., Benavent, F., Boullosa, J. R., Domingo, J., Grau, B., Massó, G., and Valentín, O. 2008. User-centred design of error correction tools. In Chair, N. C. C., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., and Tapias, D. (eds.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. Luxembourg: European Language Resources Association (ELRA), pp. 1985–9. http://www.lrec-conf.org/proceedings/lrec2008/.Google Scholar
Ritter, A., Clark, S., Mausam, , and Etzioni, O. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Workshop on Unsupervised Learning in NLP (EMNLP), Edinburgh, UK, pp. 15241534. Stroudsburg, PA: ACL.Google Scholar
Rodríguez, C., Banchs, R., Codina, J., and Grivolla, J. 2010. Cometa: semantic exploration of customer reviews to extract valuable information for business intelligence. Technical Report, Barcelona Media Innovation Center, Barcelona, Spain.Google Scholar
Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., and Estève, Y. 2011. Lium’s systems for the IWSLT 2011 speech translation tasks. In International Workshop on Spoken Language Translation, San Francisco, CA.Google Scholar
Spall, J. C., 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37 : 332–41.CrossRefGoogle Scholar
Spall, J. C., 1998. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Technical Digest 19 (4): 482–92.Google Scholar
Sproat, R., Black, A. W., Chen, S. F., Kumar, S., Ostendorf, M., and Richards, C., 2001. Normalization of non-standard words. Computer Speech & Language 15 (3): 287333.Google Scholar
Stolcke, A. 2002. Srilm-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, pp. 257–86.Google Scholar
Toral, A. 2013. Hybrid selection of language model training data using linguistic information and perplexity. In Proceedings of the Second Workshop on Hybrid Approaches to Translation, Sofia, Bulgaria, pp. 812. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Toutanova, K., and Moore, R. C. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the Associtation for Computational Linguistics, Hong Kong, pp. 144–51.Google Scholar
Villegas, M., Brosa, M. I., and Bel, N. 1998. El léxico PAROLE del español. In XIV Congreso de la Sociedad Española para el Procesamiento del Lenguaje, pp. 85–9.Google Scholar
Zhu, C., Tang, J., Li, H., Ng, H. T., and Zhao, T. 2007. A unified tagging approach to text normalization. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 688–95. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar