Hostname: page-component-8448b6f56d-jr42d Total loading time: 0 Render date: 2024-04-23T05:29:28.682Z Has data issue: false hasContentIssue false

(Un/Semi-)supervised SMS text message SPAM detection

Published online by Cambridge University Press:  15 October 2014

CHRIS R. GIANNELLA
Affiliation:
The MITRE Corporation, 7515 Colshire Drive, McLean, VA 22102, USA email: cgiannella@mitre.org, rwinder@mitre.org
RANSOM WINDER
Affiliation:
The MITRE Corporation, 7515 Colshire Drive, McLean, VA 22102, USA email: cgiannella@mitre.org, rwinder@mitre.org
BRANDON WILSON
Affiliation:
Department of Computer Science, University of Maryland, College Park, MD 20742, USA email: bswilson@cs.umd.edu

Abstract

We address the problem of unsupervised and semi-supervised SMS (Short Message Service) text message SPAM detection. We develop a content-based Bayesian classification approach which is a modest extension of the technique discussed by Resnik and Hardisty in 2010. The approach assumes that the bodies of the SMS messages arise from a probabilistic generative model and estimates the model parameters by Gibbs sampling using an unlabeled, or partially labeled, SMS training message corpus. The approach classifies new SMS messages as SPAM or HAM (non-SPAM) by zero-thresholding their logit estimates. We tested the approach on a publicly available SMS corpora collected from the UK. Used in semi-supervised fashion, the approach clearly outperformed a competing algorithm, Semi-Boost. Used in unsupervised fashion, the approach outperformed a fully supervised classifier, an SVM (Support Vector Machine), when the number of training messages used by the SVM was small and performed comparably otherwise. We believe the approach works well and is a useful tool for SMS SPAM detection.

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Almeida, T., Gomez Hidalgo, J. M., and Silva, T. P., 2013. Towards SMS spam filtering: results under a new dataset. International Journal of Information Security Science 2 (1): 118.Google Scholar
Almeida, T., Gomez Hidalgo, J. M., and Yamakami, A. 2011. Contribution to the study of SMS SPAM filtering: new collection and results. In Proceedings of the ACM Symposium on Document Engineering. Mountain View, CA USA: Association for Computing Machinery.CrossRefGoogle Scholar
Balaguer, E., and Rosso, P. 2011. Detection of near-duplicate user generated contents: the SMS spam collection. In Proceedings of the International Workshop on Search and Mining User-Generated Contents. Glasgow, UK: Association for Computing Machinery.Google Scholar
Blanzieri, B., and Bryl, A. 2008. A survey of learning-based techniques of email SPAM filtering. Technical Report DIT-06-056, Information Engineering and Computer Science Department, University of Trento.Google Scholar
Cloudmark (Online; accessed 1-April-2013) Mobile Messaging Security Solutions. www.cloudmark.com/en/industries/mobile/solutionsGoogle Scholar
Cormack, G., Gomez Hidalgo, J. M., and Puertas Sanz, E. 2007. Feature engineering for mobile (SMS) SPAM filtering. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam, Netherlands: Association for Computing Machinery.CrossRefGoogle Scholar
Coskun, B., and Giura, P. 2012. Mitigating SMS spam by online detection of repetitive near-duplicate messages. In Proceedings of the IEEE Communication and Information Systems Security Symposium. Ottawa, Ontario, Canada: Institute for Electrical and Electronic Engineers.CrossRefGoogle Scholar
Delany, S., Buckley, M., and Greene, D., 2012. SMS SPAM filtering: methods and data. Expert Systems with Applications 39 (10): 98999908.CrossRefGoogle Scholar
Gomez Hidalgo, J. M., Cajigas Bringas, G., Puertas Sanz, E., and Carrero Garcia, F. 2006. Content based SMS SPAM filtering. In Proceedings of the ACM Symposium on Document Engineering. Amsterdam, Netherlands: Association for Computing Machinery.CrossRefGoogle Scholar
Goodman, J., Cormack, G., and Heckerman, D., 2007. SPAM and the ongoing battle for the inbox. Communications of the ACM 50 (2): 2433.CrossRefGoogle Scholar
Gunal, S., Ergin, S., and Gunal, E. 2012. A novel framework for SMS SPAM filtering. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications. Trabzon, Turkey: Institute for Electrical and Electronic Engineers.Google Scholar
Huffington Post (Online, accessed 1-April-2013) SMS Fraud: 95m Spam Text Messages Sent Per Day, Up 300% In 12 Months. www.huffingtonpost.co.uk/2012/05/28/sms-fraud-95m-spam-text-m_n_1550193.htmlGoogle Scholar
Langford, J., 2006. A tutorial on practical prediction theory for classification. Journal of Machine Learning Research 6 : 273306.Google Scholar
Mackay, D., and Peto, L. (1995) A hierarchical dirichlet language model. Natural Language Engineering 1 (3): 119.CrossRefGoogle Scholar
Mallapragada, P., Jin, R., Jain, A., and Liu, Y., 2009. SemiBoost: boosting for semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (11): 20002014.CrossRefGoogle ScholarPubMed
Nuruzzaman, M., Lee, C., Abdullah, F., and Choi, D., 2012. Simple SMS SPAM filtering on independent mobile phone. Security and Communication Networks 5 (10): 12091220.CrossRefGoogle Scholar
Pan, S. J., and Yang, Q., 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10): 13451359.CrossRefGoogle Scholar
Qian, F., Pathak, A., Hu, Y., Mao, Z., and Xie, Y. 2010. A case for unsupervised-learning-based SPAM filtering. In Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETERICS). New York, New York USA: Association for Computing Machinery.Google Scholar
Resnik, P., and Hardisty, E. 2010. Gibbs sampling for the uninitiated. Technical Report LAMP-TR-153. Language and Media Processing Laboratory, University of Maryland College Park. College Park, Maryland USA.Google Scholar
Settles, B. 2009. Active learning literature survey. Technical Report 1648. Computer Sciences Department, University of Wisconsin. Madison, Wisconsin USA.Google Scholar
Sohn, D., Lee, J., and Rim, H. 2009. The contribution of stylistic information to content-based mobile SPAM filtering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Suntec, Singapore: Association for Computational Linguistics.CrossRefGoogle Scholar
Tagg, C. 2009. A corpus linguisitics study of SMS text messaging. PhD thesis, Department of English, University of Birmingham, UK.Google Scholar
The International Telecommunication Union (Online; accessed 1-April-2013) The World in 2010: ICT Facts and Figures www.itu.int/ITUD/ict/facts/2011/material/ICTFactsFigures2010.pdfGoogle Scholar
Uemura, T., Ikeda, D., Kida, T., and Arimura, H., 2011. Unsupervised SPAM detection by document probability estimation with maximal overlap method. Information and Media Technologies 6 (1): 231240.Google Scholar
Wang, C., Zhang, Y., Chen, X., Liu, Z., Shi, L., Chen, G., Qiu, F., Ying, C., and Lu, W. 2010. A behavior-based SMS antispam system. IBM Journal of Research & Development 54 (6): 3:13:16.CrossRefGoogle Scholar
Xu, Q., Xiang, E., Qiang, Y., Du, J., and Zhong, J., 2012. SMS SPAM detection using non-content features. IEEE Intelligent Systems Magazine 27 (6): 4451.CrossRefGoogle Scholar
Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A., and Naik, V. 2011. SMSAssassin: crowdsourcing driven mobile-based system for SMS SPAM filtering. In Proceedings of the International Workshop on Mobile Computing Systems and Applications. Phoenix, Arizona USA: Association for Computing Machinery.CrossRefGoogle Scholar