Authorship analysis of aliases: Does topic influence accuracy?

ROBERT LAYTON; PAUL A. WATTERS; RICHARD DAZELEY

doi:10.1017/S1351324913000272

Authorship analysis of aliases: Does topic influence accuracy?

Published online by Cambridge University Press: 08 October 2013

ROBERT LAYTON ,

PAUL A. WATTERS and

RICHARD DAZELEY

Show author details

ROBERT LAYTON: Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
PAUL A. WATTERS: Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
RICHARD DAZELEY: Affiliation:
Data Mining and Informatics Research Group, University of Ballarat, Australia e-mail: r.dazeley@ballarat.edu.au

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Aliases play an important role in online environments by facilitating anonymity, but also can be used to hide the identity of cybercriminals. Previous studies have investigated this alias matching problem in an attempt to identify whether two aliases are shared by an author, which can assist with identifying users. Those studies create their training data by randomly splitting the documents associated with an alias into two sub-aliases. Models have been built that can regularly achieve over 90% accuracy for recovering the linkage between these ‘random sub-aliases’. In this paper, random sub-alias generation is shown to enable these high accuracies, and thus does not adequately model the real-world problem. In contrast, creating sub-aliases using topic-based splitting drastically reduces the accuracy of all authorship methods tested. We then present a methodology that can be performed on non-topic controlled datasets, to produce topic-based sub-aliases that are more difficult to match. Finally, we present an experimental comparison between many authorship methods to see which methods better match aliases under these conditions, finding that local n-gram methods perform better than others.

Type: Articles
Information: Natural Language Engineering , Volume 21 , Issue 4 , August 2015 , pp. 497 - 518

DOI: https://doi.org/10.1017/S1351324913000272 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aggarwal, C. C., and Zhai, C. X. (eds.) 2012. A survey of text classification algorithms. Mining Text Data, Springer, pp. 163–222. doi: 10.1007/978-1-4614-3223-4_6.CrossRef Google Scholar

Alazab, M., Layton, R., Venkataraman, S., and Watters, P. 2010. Malware detection based on structural and behavioural features of API calls. In Proceedings of the International Cyber Resilience Conference, School of Computer and Information Science, Security Research Centre, Edith Cowan University, Perth, Western Australia.Google Scholar

Choudhury, J., Kimtani, D. K. and Chakrabarty, A. 2012. Text clustering using a WordNet-based knowledge-base and the Lesk Algorithm. International Journal of Computer Applications 48 (21): 20–4.CrossRef Google Scholar

Clarke, R. V. G., 1997. Situational Crime Prevention. Guilderland, New York: Criminal Justice Press.Google Scholar

Escalante, H., Montes-y Gómez, M., and Solorio, T. 2011. A weighted profile intersection measure for profile-based authorship attribution. Advances in Artificial Intelligence, 7094: 232–43.CrossRef Google Scholar

Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C. E., and Howald, B. S., 2007. Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. International Journal of Digital Evidence 6 (1): 1–18.Google Scholar

Holzer, R., Malin, B., and Sweeney, L. 2005. Email Alias Detection Using Social Network Analysis. PhD thesis. Information Networking Institute, Carnegie Mellon University.CrossRef Google Scholar

Hotho, A., Staab, S., and Stumme, G., 2003. Ontologies improve text document clustering. In Third IEEE International Conference on Data Mining, 2003. ICDM 2003, Melbourne, Florida: IEEE, pp. 541–4.Google Scholar

Jing, L., Zhou, L., Ng, M. K., and Huang, J. Z. 2006. Ontology-based distance measure for text clustering. In Proceedings of the Text Mining Workshop, SIAM International Conference on Data Mining, Bethesda, Maryland.Google Scholar

Juola, P. 2004. Ad-hoc authorship attribution competition. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, Sweden, pp. 175–6.Google Scholar

Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics.Google Scholar

Koppel, M., Schler, J., and Argamon, S. 2010. Authorship attribution in the wild. Language Resources and Evaluation 45 (1): 83–94. ISSN . doi: 10.1007/s10579-009-9111-2.CrossRef Google Scholar

Layton, R., McCombie, S., and Watters, P. A., 2012. Authorship attribution of IRC messages using inverse author frequency. In Cybercrime and Trustworthy Computing Workshop (CTC), 2012 Third, Ballarat, Australia: IEEE, pp. 7–13.CrossRef Google Scholar

Layton, R., and Watters, P. A., 2009. Determining provenance in phishing websites using automated conceptual analysis. In eCrime Researchers Summit, 2009. eCRIME’09., Tacoma, WA, pp. 1–7.Google Scholar

Layton, R., Watters, P. A., and Dazeley, R. 2010. Authorship attribution for Twitter in 140 characters or less. In 2010 Second Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, pp. 1–8. ISBN 978-1-4244-8054-8. doi: 10.1109/CTC.2010.17.Google Scholar

Layton, R., Watters, P., and Dazeley, R. 2011a. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 1 (1): 1–26.Google Scholar

Layton, R., Watters, P., and Dazeley, R. 2011b. Automatically determining phishing campaigns using the USCAP methodology. In eCrime Researchers Summit (eCrime), 2010, Dallas, TX, pp. 1–8.Google Scholar

Layton, R., Watters, P. A., and Dazeley, R. 2011c. Recentred local profiles for authorship attribution. Journal of Natural Language Engineering 18 (3): 293–312. doi: 10.1017/S1351324911000180. Available on CJO 2011.CrossRef Google Scholar

Luyckx, K., and Daelemans, W. 2011. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26 (1): 35.CrossRef Google Scholar

Narayanan, A., Paskov, H., Gong, N. Z., and Bethencourt, J. 2012. On the feasibility of internet-scale author identification. In Proceedings of the 33rd conference on IEEE Symposium on Security and Privacy, San Francisco, CA,.Google Scholar

Novak, J., Raghavan, P., and Tomkins, A. 2004. Anti-aliasing on the web. In Proceedings of the 13th conference on World Wide Web - WWW ’04, New York: ACM, pp. 30–9. doi: 10.1145/988672.988678.CrossRef Google Scholar

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E., 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12: 2825–30.Google Scholar

Pillay, S. R., and Solorio, T., 2010. Authorship attribution of web forum posts. In eCrime Researchers Summit (eCrime), 2010, Dallas, TX, pp. 1–7.Google Scholar

Rudman, J., 1998. The state of authorship attribution studies: some problems and solutions. Computers and the Humanities 31: 351–65.CrossRef Google Scholar

Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513–23.CrossRef Google Scholar

Salton, G., and McGill, M. J., 1986. Introduction to Modern Information Retrieval. New York: McGraw-Hill.Google Scholar

Schein, A. I., Caver, J. F., Honaker, R. J., and Martell, C. H., 2010. Author attribution evaluation with novel topic cross-validation. In KDIR, Valencia, Spain, pp. 206–15.Google Scholar

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34 (1): 1–47.CrossRef Google Scholar

Sedding, J., and Kazakov, D. 2004. Wordnet-based text document clustering. In Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, Geneva: Association for Computational Linguistics, pp. 104–13.Google Scholar

Solorio, T., Pillay, S., Raghavan, S., and Montes-y Gómez, M., 2011. Modality specific meta features for authorship attribution in web forum posts. In IJCNLP, Chiang Mai, Thailand, pp. 156–64.Google Scholar

Stabek, A., Watters, P. A., and Layton, R., 2010. The seven scam types: mapping the terrain of cybercrime. In Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, Ballarat, Australia, pp. 41–51.CrossRef Google Scholar

Stamatatos, E. 2007. Author identification using imbalanced and limited training texts. In 18th International Workshop on Database and Expert Systems Applications, 2007. DEXA’07., Regensburg, pp. 237–41.Google Scholar

Ureche, O., Layton, R., and Watters, P. A., 2012. Towards an implementation of information flow security using semantic web technologies. In 2012 Third Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, pp. 1–8.Google Scholar

Watters, P. A., McCombie, S., Layton, R., and Pieprzyk, J. 2012. Characterising and predicting cyber attacks using the Cyber Attacker Model Profile (CAMP). Journal of Money Laundering Control 15 (4): 430–41.CrossRef Google Scholar

Watters, P. A., and Patel, M. 1998. Modeling lexical-semantic processes using wordnet. Glot International 3 (9–10): 23–4.Google Scholar

Zheng, R., Li, J., Chen, H., and Huang, Z., 2005. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57: 378–93.CrossRef Google Scholar

Article contents

Authorship analysis of aliases: Does topic influence accuracy?

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests