Evaluating authorship distance methods using the positive Silhouette coefficient

ROBERT LAYTON; PAUL WATTERS; RICHARD DAZELEY

doi:10.1017/S1351324912000241

Evaluating authorship distance methods using the positive Silhouette coefficient

Published online by Cambridge University Press: 28 September 2012

ROBERT LAYTON ,

PAUL WATTERS and

RICHARD DAZELEY

Show author details

ROBERT LAYTON: Affiliation:
Internet Commerce Security Laboratory University of Ballarat, Ballarat VIC, Australia e-mail: r.layton@icsl.com.au
PAUL WATTERS: Affiliation:
Internet Commerce Security Laboratory University of Ballarat, Ballarat VIC, Australia e-mail: r.layton@icsl.com.au
RICHARD DAZELEY: Affiliation:
Data Mining and Informatics Research Group University of Ballarat, Ballarat VIC, Australia e-mails: p.watters@ballarat.edu.au, r.dazeley@ballarat.edu.au

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Unsupervised Authorship Analysis (UAA) aims to cluster documents by authorship without knowing the authorship of any documents. An important factor in UAA is the method for calculating the distance between documents. This choice of the authorship distance method is considered more critical to the end result than the choice of cluster analysis algorithm. One method for measuring the correlation between a distance metric and a labelling (such as class values or clusters) is the Silhouette Coefficient (SC). The SC can be leveraged by measuring the correlation between the authorship distance method and the true authorship, evaluating the quality of the distance method. However, we show that the SC can be severely affected by outliers. To address this issue, we introduce the Positive Silhouette Coefficient, given as the proportion of instances with a positive SC value. This metric is not easily altered by outliers and produces a more robust metric. A large number of authorship distance methods are then compared using the PSC, and the findings are presented. This research provides an insight into the efficacy of methods for UAA and presents a framework for testing authorship distance methods.

Type: Articles
Information: Natural Language Engineering , Volume 19 , Issue 4 , October 2013 , pp. 517 - 535

DOI: https://doi.org/10.1017/S1351324912000241 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Allison, B., and Guthrie, L. 2008. Authorship attribution of e-mail: comparing classifiers over a new corpus for evaluation. In Proceedings of LREC, Vol. 8. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.Google Scholar

Arthur, D., and Vassilvitskii, S. 2007. K-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.Google Scholar

Corbin, M. 2011. Authorship Attribution in the Enron Email Corpus. PhD thesis, University of Maryland, Baltimore, MD, USA.Google Scholar

Davies, D. L., and Bouldin, D. W. 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2: 224–27.CrossRef Google Scholar

Duarte, J., Fred, A., Lourenço, A., and Duarte, F. 2010. On consensus clustering validation. In Structural, Syntactic, and Statistical Pattern Recognition, pp. 385–94. Lecture Notes in Computer Science, Vol. 6218. Berlin, Germany: Springer.CrossRef Google Scholar

Dunn, J. C. 1974. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4 (1):95–104.CrossRef Google Scholar

Foggia, P., Percannella, G., Sansone, C., and Vento, M. 2007. A graph-based clustering method and its applications. In Proceedings of the 2nd International Conference on Advances in Brain, Vision, and Artificial Intelligence, pp. 277–87. Berlin, Germany: Springer-Verlag.Google Scholar

Frantzeskou, G., Stamatatos, E, Gritzalis, S., and Chaski, C. E. 2007. Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. International Journal of Digital Evidence 6.Google Scholar

Hartigan, J. A., and Wong, M. A. 1979. A K-means clustering algorithm. Applied Statistics 28 (1):100–8.CrossRef Google Scholar

Huber, P. J., and Ronchetti, E. 1981. Robust Statistics, 2nd ed. Wiley Online Library. http://au.wiley.com/WileyCDA/WileyTitle/productCd-0470129905.html (Accessed 17 Sep 2012).Google Scholar

Iqbal, F., Hadjidj, R., Fung, Benjamin C. M., and Debbabi, M. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. (Proceedings of the Eighth Annual DFRWS Conference). Digital Investigation 5 (Suppl 1):S42–S51.CrossRef Google Scholar

Jones, K. S. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28: 11–21.CrossRef Google Scholar

Juola, P. 2008. Authorship Attribution. Hanover, MA, USA: Now Pub.Google Scholar

Juola, P., and Baayen, R. H. 2005. A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing 20: 59–67.CrossRef Google Scholar

Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. Proceedings of the Conference of the Pacific Association for Computational Linguistics (PACLING).Google Scholar

Klimt, B., and Yang, Y. 2004. Introducing the Enron corpus. Proceedings of the First Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA.Google Scholar

Layton, R., Watters, P., and Dazeley, R. 2010. Automatically determining phishing campaigns using the uscap methodology. In Proceedings of the General Members Meeting and eCrime Researchers Summit (eCrime 2010), pp. 1–8. New York, NY, USA: IEEE.Google Scholar

Layton, R., Watters, P., and Dazeley, R. 2011a Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 1 (1): 1–26.Google Scholar

Layton, R., Watters, P., and Dazeley, R. 2011b. Recentred local profiles for authorship attribution. Journal of Natural Language Engineering. doi:10.1017/S1351324911000180. Available on CJO 2011.Google Scholar

Pillay, S. R., and Solorio, T. 2011. Authorship attribution of web forum posts. In Proceedings of the General Members Meeting and eCrime Researchers Summit (eCrime 2010), pp. 1–7. New York, NY, USA: IEEE.Google Scholar

Pollard, H. S. 1934. On the relative stability of the median and arithmetic mean, with particular reference to certain frequency distributions which can be dissected into normal distributions. The Annals of Mathematical Statistics 5 (3):227–62.CrossRef Google Scholar

Rosenberg, A., and Hirschberg, J. 2007. V-measure: a conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–20. Prague, Czech Republic: Association for Computational Linguistics.Google Scholar

Rousseeuw, P. 1987 Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20:53–65.CrossRef Google Scholar

Stamatatos, E. 2006. Authorship attribution based on feature set subspacing ensembles. International Journal on Artificial Intelligence Tools 15 (5):823–38.CrossRef Google Scholar

Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 57 (3): 378–393.Google Scholar

Zheng, R., Li, J., Chen, H., and Huang, Z. 2005. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57:378–393.CrossRef Google Scholar

Article contents

Evaluating authorship distance methods using the positive Silhouette coefficient

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests