(Un/Semi-)supervised SMS text message SPAM detection

CHRIS R. GIANNELLA; RANSOM WINDER; BRANDON WILSON

doi:10.1017/S1351324914000102

(Un/Semi-)supervised SMS text message SPAM detection

Published online by Cambridge University Press: 15 October 2014

CHRIS R. GIANNELLA ,

RANSOM WINDER and

BRANDON WILSON

Show author details

CHRIS R. GIANNELLA: Affiliation:
The MITRE Corporation, 7515 Colshire Drive, McLean, VA 22102, USA email: cgiannella@mitre.org, rwinder@mitre.org
RANSOM WINDER: Affiliation:
The MITRE Corporation, 7515 Colshire Drive, McLean, VA 22102, USA email: cgiannella@mitre.org, rwinder@mitre.org
BRANDON WILSON: Affiliation:
Department of Computer Science, University of Maryland, College Park, MD 20742, USA email: bswilson@cs.umd.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We address the problem of unsupervised and semi-supervised SMS (Short Message Service) text message SPAM detection. We develop a content-based Bayesian classification approach which is a modest extension of the technique discussed by Resnik and Hardisty in 2010. The approach assumes that the bodies of the SMS messages arise from a probabilistic generative model and estimates the model parameters by Gibbs sampling using an unlabeled, or partially labeled, SMS training message corpus. The approach classifies new SMS messages as SPAM or HAM (non-SPAM) by zero-thresholding their logit estimates. We tested the approach on a publicly available SMS corpora collected from the UK. Used in semi-supervised fashion, the approach clearly outperformed a competing algorithm, Semi-Boost. Used in unsupervised fashion, the approach outperformed a fully supervised classifier, an SVM (Support Vector Machine), when the number of training messages used by the SVM was small and performed comparably otherwise. We believe the approach works well and is a useful tool for SMS SPAM detection.

Type: Articles
Information: Natural Language Engineering , Volume 21 , Issue 4 , August 2015 , pp. 553 - 567

DOI: https://doi.org/10.1017/S1351324914000102 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Almeida, T., Gomez Hidalgo, J. M., and Silva, T. P., 2013. Towards SMS spam filtering: results under a new dataset. International Journal of Information Security Science 2 (1): 1–18.Google Scholar

Almeida, T., Gomez Hidalgo, J. M., and Yamakami, A. 2011. Contribution to the study of SMS SPAM filtering: new collection and results. In Proceedings of the ACM Symposium on Document Engineering. Mountain View, CA USA: Association for Computing Machinery.CrossRef Google Scholar

Balaguer, E., and Rosso, P. 2011. Detection of near-duplicate user generated contents: the SMS spam collection. In Proceedings of the International Workshop on Search and Mining User-Generated Contents. Glasgow, UK: Association for Computing Machinery.Google Scholar

Blanzieri, B., and Bryl, A. 2008. A survey of learning-based techniques of email SPAM filtering. Technical Report DIT-06-056, Information Engineering and Computer Science Department, University of Trento.Google Scholar

Cloudmark (Online; accessed 1-April-2013) Mobile Messaging Security Solutions. www.cloudmark.com/en/industries/mobile/solutions Google Scholar

Cormack, G., Gomez Hidalgo, J. M., and Puertas Sanz, E. 2007. Feature engineering for mobile (SMS) SPAM filtering. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam, Netherlands: Association for Computing Machinery.CrossRef Google Scholar

Coskun, B., and Giura, P. 2012. Mitigating SMS spam by online detection of repetitive near-duplicate messages. In Proceedings of the IEEE Communication and Information Systems Security Symposium. Ottawa, Ontario, Canada: Institute for Electrical and Electronic Engineers.CrossRef Google Scholar

Delany, S., Buckley, M., and Greene, D., 2012. SMS SPAM filtering: methods and data. Expert Systems with Applications 39 (10): 9899–9908.CrossRef Google Scholar

Gomez Hidalgo, J. M., Cajigas Bringas, G., Puertas Sanz, E., and Carrero Garcia, F. 2006. Content based SMS SPAM filtering. In Proceedings of the ACM Symposium on Document Engineering. Amsterdam, Netherlands: Association for Computing Machinery.CrossRef Google Scholar

Goodman, J., Cormack, G., and Heckerman, D., 2007. SPAM and the ongoing battle for the inbox. Communications of the ACM 50 (2): 24–33.CrossRef Google Scholar

Gunal, S., Ergin, S., and Gunal, E. 2012. A novel framework for SMS SPAM filtering. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications. Trabzon, Turkey: Institute for Electrical and Electronic Engineers.Google Scholar

Huffington Post (Online, accessed 1-April-2013) SMS Fraud: 95m Spam Text Messages Sent Per Day, Up 300% In 12 Months. www.huffingtonpost.co.uk/2012/05/28/sms-fraud-95m-spam-text-m_n_1550193.html Google Scholar

Langford, J., 2006. A tutorial on practical prediction theory for classification. Journal of Machine Learning Research 6 : 273–306.Google Scholar

Mackay, D., and Peto, L. (1995) A hierarchical dirichlet language model. Natural Language Engineering 1 (3): 1–19.CrossRef Google Scholar

Mallapragada, P., Jin, R., Jain, A., and Liu, Y., 2009. SemiBoost: boosting for semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (11): 2000–2014.CrossRef Google Scholar PubMed

Nuruzzaman, M., Lee, C., Abdullah, F., and Choi, D., 2012. Simple SMS SPAM filtering on independent mobile phone. Security and Communication Networks 5 (10): 1209–1220.CrossRef Google Scholar

Pan, S. J., and Yang, Q., 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10): 1345–1359.CrossRef Google Scholar

Qian, F., Pathak, A., Hu, Y., Mao, Z., and Xie, Y. 2010. A case for unsupervised-learning-based SPAM filtering. In Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETERICS). New York, New York USA: Association for Computing Machinery.Google Scholar

Resnik, P., and Hardisty, E. 2010. Gibbs sampling for the uninitiated. Technical Report LAMP-TR-153. Language and Media Processing Laboratory, University of Maryland College Park. College Park, Maryland USA.Google Scholar

Settles, B. 2009. Active learning literature survey. Technical Report 1648. Computer Sciences Department, University of Wisconsin. Madison, Wisconsin USA.Google Scholar

Sohn, D., Lee, J., and Rim, H. 2009. The contribution of stylistic information to content-based mobile SPAM filtering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Suntec, Singapore: Association for Computational Linguistics.CrossRef Google Scholar

Tagg, C. 2009. A corpus linguisitics study of SMS text messaging. PhD thesis, Department of English, University of Birmingham, UK.Google Scholar

The International Telecommunication Union (Online; accessed 1-April-2013) The World in 2010: ICT Facts and Figures www.itu.int/ITUD/ict/facts/2011/material/ICTFactsFigures2010.pdf Google Scholar

Uemura, T., Ikeda, D., Kida, T., and Arimura, H., 2011. Unsupervised SPAM detection by document probability estimation with maximal overlap method. Information and Media Technologies 6 (1): 231–240.Google Scholar

Wang, C., Zhang, Y., Chen, X., Liu, Z., Shi, L., Chen, G., Qiu, F., Ying, C., and Lu, W. 2010. A behavior-based SMS antispam system. IBM Journal of Research & Development 54 (6): 3:1–3:16.CrossRef Google Scholar

Xu, Q., Xiang, E., Qiang, Y., Du, J., and Zhong, J., 2012. SMS SPAM detection using non-content features. IEEE Intelligent Systems Magazine 27 (6): 44–51.CrossRef Google Scholar

Yadav, K., Kumaraguru, P., Goyal, A., Gupta, A., and Naik, V. 2011. SMSAssassin: crowdsourcing driven mobile-based system for SMS SPAM filtering. In Proceedings of the International Workshop on Mobile Computing Systems and Applications. Phoenix, Arizona USA: Association for Computing Machinery.CrossRef Google Scholar

Article contents

(Un/Semi-)supervised SMS text message SPAM detection

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests