Silhouette + attraction: A simple and effective method for text clustering†

MARCELO L. ERRECALDE; LETICIA C. CAGNINA; PAOLO ROSSO

doi:10.1017/S1351324915000273

Silhouette + attraction: A simple and effective method for text clustering†

Published online by Cambridge University Press: 14 August 2015

MARCELO L. ERRECALDE ,

LETICIA C. CAGNINA and

PAOLO ROSSO

Show author details

MARCELO L. ERRECALDE: Affiliation:
LIDIC, Universidad Nacional de San Luis, San Luis, Argentina e-mail: merreca@unsl.edu.ar, lcagnina@gmail.com
LETICIA C. CAGNINA: Affiliation:
LIDIC, Universidad Nacional de San Luis, San Luis, Argentina e-mail: merreca@unsl.edu.ar, lcagnina@gmail.com
PAOLO ROSSO: Affiliation:
NLE Lab, PRHLT Research Center, Universitat Politècnica de València, Valencia, España e-mail: prosso@dsic.upv.es

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

This article presents silhouette–attraction (Sil–Att), a simple and effective method for text clustering, which is based on two main concepts: the silhouette coefficient and the idea of attraction. The combination of both principles allows us to obtain a general technique that can be used either as a boosting method, which improves results of other clustering algorithms, or as an independent clustering algorithm. The experimental work shows that Sil–Att is able to obtain high-quality results on text corpora with very different characteristics. Furthermore, its stable performance on all the considered corpora is indicative that it is a very robust method. This is a very interesting positive aspect of Sil–Att with respect to the other algorithms used in the experiments, whose performances heavily depend on specific characteristics of the corpora being considered.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 5 , September 2016 , pp. 687 - 726

DOI: https://doi.org/10.1017/S1351324915000273 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2015

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

†

This research work has been partially funded by UNSL, CONICET (Argentina), DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project, and the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework on Web Information Quality Evaluation Initiative. The work of the third author was done also in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.

References

Alexandrov, M., Gelbukh, A., and Rosso, P. 2005. An approach to clustering abstracts. In Montoyo, A., Muñoz, R., and Métais, E. (eds.), Natural Language Processing and Information Systems, vol. 3513, pp. 275–85. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg,.Google Scholar

Aranganayagi, S., and Thangavel, K., 2007. Clustering categorical data using silhouette coefficient as a relocating measure. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA-2007), vol. 02, California, USA: IEEE Computer Society, pp. 13–7.Google Scholar

Azzag, H., Monmarche, N., Slimane, M., Venturini, G., and Guinot, C., 2003. AntTree: a new model for clustering with artificial ants. In Proceedings of the IEEE Conference on Evolutionary Computation (CEC-2003), vol. 04, New Jersey, USA: IEEE Press, pp. 2642–7.Google Scholar

Banerjee, S., Ramanathan, K., and Gupta, A., 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR-2007), New York, USA: ACM, pp. 787–8.Google Scholar

Barnette, J. J., and McLean, J. E., 1998. The Tukey honestly significant difference procedure and its control of the type I error-rate. In Proceedings of 27 Annual Meeting of the Mid-South Educational Research Association, John Petry. Louisiana, USA, pp. 2–15.Google Scholar

Berry, M. W. (ed.), 2003. Survey of Text Mining: Clustering, Classification, and Retrieval. New York, USA: Springer Science+Business Media, Inc.Google Scholar

Bezdek, J. C., and Pal, N. R., 1995. Cluster validation with generalized Dunn’s indices. In Proceedings of the Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, California, USA: IEEE Computer Society Press, pp. 190–3.Google Scholar

Bonato dos Santos, J., Heuser, C., Pereira Moreira, V., and Krug Wives, L., 2011. Automatic threshold estimation for data matching applications. Information Sciences 181 (13): 2685–99.Google Scholar

Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and Dougherty, E. R., 2007. Model-based evaluation of clustering validation measures. Pattern Recognition 40 (3): 807–24.CrossRef Google Scholar

Cagnina, L., Errecalde, M., Ingaramo, D., and Rosso, P., 2008. A discrete particle swarm optimizer for clustering short-text corpora. In Proceedings of the International Conference on Bioinspired Optimization Methods and their Applications (BIOMA08), Ljubljana, Slovenia: Jozef Stefan Institute, pp. 93–103.Google Scholar

Cagnina, L., Errecalde, M., Ingaramo, D., and Rosso, P., 2014. An efficient particle swarm optimization approach to cluster short texts. Information Sciences 265 : 36–49.CrossRef Google Scholar

Carullo, M., Binaghi, E., and Gallo, I., 2009. An online document clustering technique for short web contents. Pattern Recognition Letters 30 (10): 870–6.Google Scholar

Charikar, M., Chekuri, C., Feder, T., and Motwani, R., 2004. Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33 (6): 1417–40.Google Scholar

Choi, M. J., Tan, V. Y. F., Anandkumar, A., and Willsky, A. S., 2011. Learning latent tree graphical models. Journal of Machine Learning Research 12 : 1771–812.Google Scholar

Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W., 1992. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2002), Interface Design and Display, New York, USA: ACM, pp. 318–29.Google Scholar

Davies, D. L., and Bouldin, D. W., 1979. A cluster separation measure. IEEE Transactions on Analysis and Machine Intelligence 1 (2): 224–7.Google Scholar

Dunn, J. C., 1974. Well separated clusters and optimal fuzzy-partitions. Journal of Cybernetics 4 (1): 95–104.Google Scholar

Errecalde, M., Ingaramo, D., and Rosso, P., 2008. Proximity estimation and hardness of short-text corpora. In 19th International Workshop on Database and Expert Systems Application, California, USA: IEEE Computer Society, pp. 15–9.Google Scholar

Errecalde, M., Ingaramo, D., and Rosso, P., 2010. ITSA*: an effective iterative method for short-text clustering tasks. In Proceedings of the 23rd International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems, IEA/AIE 2010, vol. 6096, Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin, Heidelberg., pp. 550–9.Google Scholar

Fisher, R., 1925. Statistical Methods for Research Workers. Edinburgh, Scotland: Oliver & Boyd.Google Scholar

Hasan, M. A., Chaoji, V., Salem, S., and Zaki, M. J., 2009. Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters 30 (11): 994–1002.Google Scholar

He, H., Chen, B., Xu, W., and Guo, J., 2007. Short text feature extraction and clustering for web topic mining. In Proceedings of the 3rd International Conference on Semantics, Knowledge and Grid, California, USA: IEEE Computer Society, pp. 382–5.Google Scholar

Hearst, M. A., 2006. Clustering versus faceted categories for information exploration. Communications of the ACM 49 (4): 59–61.Google Scholar

Hu, X., Sun, N., Zhang, C., and Chua, T., 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, New York, USA: ACM, pp. 919–28.CrossRef Google Scholar

Ingaramo, D., Errecalde, M., Cagnina, L., and Rosso, P. 2009. Particle Swarm Optimization for clustering short-text corpora. In Computational Intelligence and Bioengineering, pp. 3–19. Amsterdam, The Netherlands: IOS Press.Google Scholar

Ingaramo, D., Errecalde, M., and Rosso, P. 2010a. A general bio-inspired method to improve the short-text clustering task. In Proceedings of 11th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2010, vol. 6008, pp. 661–72. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg.Google Scholar

Ingaramo, D., Errecalde, M., and Rosso, P., 2010b. A new AntTree-based algorithm for clustering short-text corpora. Journal of Computer Science & Technology 10 (1): 1–7.Google Scholar

Ingaramo, D., Leguizamón, G., and Errecalde, M., 2005. Adaptive clustering with artificial ants. Journal of Computer Science & Technology 5 (4): 264–71.Google Scholar

Ingaramo, D., Pinto, D., Rosso, P., and Errecalde, M. 2008. Evaluation of internal validity measures in short-text corpora. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2008, vol. 4919, pp. 555–67. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg.Google Scholar

Ingaramo, D., Rosas, M. V., Errecalde, M., and Rosso, P., 2011. Clustering iterativo de textos cortos con representaciones basadas en concepto. Procesamiento del Lenguaje Natural 46 : 19–26.Google Scholar

Jiang, D., Pei, J., and Zhang, A., 2003. DHC: a density-based hierarchical clustering method for time series gene expression data. In Proceedings 3rd IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), California, USA: IEEE Computer Society, pp. 393–400.Google Scholar

Jing, L. 2005. Survey of text clustering. Technical report. Department of Mathematics. The University of Hong Kong, Hong Kong, China.Google Scholar

Karthikeyan, T., Peter, S. J., and Chidambaranathan, S., 2011. Hybrid algorithm for noise-free high density clusters with self-detection of best number of clusters. International Journal of Hybrid Information Technology 4 (2): 39–54.Google Scholar

Karypis, G., Han, E., and Kumar, V., 1999. Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer 32 (8): 68–75.Google Scholar

Kaufman, L., and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, New York, USA: John Wiley and Sons.Google Scholar

Kruskal, W. H., and Wallis, W. A., 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47 (260): 583–621.Google Scholar

Kyriakopoulou, A. 2008. Text classification aided by clustering: a literature review. In Tools in Artificial Intelligence, pp. 233–52. Rijeka, Croatia: INTECH.Google Scholar

Larocca Neto, J., Santos, A. D., Kaestner, C. A. A., and Freitas, A. A. 2000. Document clustering and text summarization. In Mackin, N. (ed.), Proceedings of the 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), London, England: The Practical Application Company, pp. 41–55.Google Scholar

Levene, H. 1960. Robust tests for equality of variances. In Contributions to Probability and Statistics, pp. 278–92. California, USA: Stanford University Press.Google Scholar

Liu, X., and Croft, W. B. 2004. Cluster-based retrieval using language models. In Sanderson, M., Jrvelin, K., Allan, J., and Bruza, P. (eds.), Proceedings of the 27th annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2004), New York, USA: ACM, pp. 186–93.Google Scholar

MacQueen, J. B., 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, California, USA: University of California Press, pp. 281–97.Google Scholar

Makagonov, P., Alexandrov, M., and Gelbukh, A. 2004. Clustering abstracts instead of full texts. In Text, Speech and Dialogue, 10th International Conference (TSD 2004), vol. 3206, pp. 129–35. Lecture Notes in Artificial Intelligence, Germany: Springer-Verlag Berlin Heidelberg.Google Scholar

Manning, C. D., Raghavan, P., and Schutze, H., 2008. Introduction to Information Retrieval. Cambridge, England: Cambridge University Press.Google Scholar

Ng, A. Y., Jordan, M. I., and Weiss, Y. 2001. On spectral clustering: analysis and an algorithm. In The Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference. vol. 14, pp. 849–56. Advances in Neural Information Processing Systems. Massachusetts, USA: MIT Press.Google Scholar

Ng, R. T., and Han, J., 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94), California, USA: Morgan Kaufmann Publishers Inc., pp. 144–55.Google Scholar

Paterlini, A. A., Nascimento, M. A., and Traina, C. Jr, 2011. Using pivots to speed-up k-medoids clustering. Journal of Information and Data Management 2 (2): 221–36.Google Scholar

Pinto, D., Benedí, J. M., and Rosso, P. 2007. Clustering narrow-domain short texts by using the Kullback-Leibler distance. In Computational Linguistics and Intelligent Text Processing, 8th International Conference (CICLing 2007), vol. 4394, pp. 611–22. Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin Heidelberg.Google Scholar

Pinto, D., and Rosso, P. 2007. On the relative hardness of clustering corpora. In Text, Speech and Dialogue, 10th International Conference (TSD 2007), vol. 4629, pp. 155–61. Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin Heidelberg.CrossRef Google Scholar

Pons-Porrata, A., Berlanga-Llavori, R., and Ruiz-Shulcloper, J., 2007. Topic discovery based on text mining techniques. Information Processing and Management 43 (3): 752–68.Google Scholar

Popova, S., and Khodyrev, I. 2011. Local theme detection and annotation with keywords for narrow and wide domain short text collections. In Proceedings of the 5th International Conference on Advances in Semantic Processing (SEMAPRO 2011). Delaware, USA: International Academy, Reseach, and Industry Association (IARIA), pp. 49–55.Google Scholar

Qi, G.-J., Aggarwal, C., and Huang, T., 2012. Community detection with edge content in social media networks. In IEEE 28th International Conference on Data Engineering (ICDE 2012), California, USA: IEEE Computer Society - Conference Publishing Services (CPS), pp. 534–45.Google Scholar

Rousseeuw, P., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20 (1): 53–65.Google Scholar

Selim, S. Z., and Alsultan, K., 1991. A simulated annealing algorithm for the clustering problem. Pattern Recognition 24 (10): 1003–8.Google Scholar

Shannon, C. E., 1948. A mathematical theory of communication. The Bell system technical journal 27 : 379–423.CrossRef Google Scholar

Slonim, N., Friedman, N., and Tishby, N., 2002. Unsupervised document classification using sequential information maximization. In Proceedings of the 25th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2002), New York, USA: ACM, pp. 129–36.Google Scholar

Spearman, C., 1904. The proof and measurement of association between two things. The American Journal of Psychology 15 (1): 72–101.Google Scholar

Stein, B., and Busch, M., 2005. Density-based cluster algorithms in low-dimensional and high-dimensional applications. In 2nd International Workshop on Text-Based Information Retrieval (TIR 2005), Fachberichte Informatik, Mainz, Germany: University of Koblenz-Landau, pp. 45–56.Google Scholar

Stein, B., and Meyer zu Eissen, S. 2002. Document categorization with MajorClust. In Proceedings Workshop on Issues in the Theory of Security (WITS 2002), Barcelona, Spain: Technical University of Barcelona, pp. 91–6.Google Scholar

Stein, B., and Meyer zu Eissen, S., 2004. Topic identification: framework and application. In Proceedings of the International Conference on Knowledge Management (I-KNOW 2004), Graz, Austria: Know-Center, pp. 353–60.Google Scholar

Stein, B., Meyer zu Eissen, S., and Wißbrock, F., 2003. On cluster validity and the information need of users. In 3rd International Conference on Artificial Intelligence and Applications (AIA 2003), Benalmádena, Spain: ACTA Press, pp. 216–21.Google Scholar

Steinbach, M., Karypis, G., and Kumar, V., 2000. A comparison of document clustering techniques. In Working Notes - Workshop on Text Mining. The 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Massachusetts, USA, pp. 109–11.Google Scholar

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., and Varga, D., 2006. The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Paris, France: European Language Resources Association (ELRA), pp. 2142–7.Google Scholar

Takeda, T., and Takasu, A., 2007. Updatenews: a news clustering and summarization system using efficient text processing. In Proceedings of the 7th ACM/IEEE CS Joint Conference on Digital Libraries, New York, USA: ACM, pp. 438–9.Google Scholar

Tan, P. N., Steinbach, M., and Kumar, V., 2005. Introduction to Data Mining. Massachusetts, USA: Addison-Wesley Longman Publishing Co..Google Scholar

Tu, L., and Chen, Y., 2009. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data 3 : 1–27.Google Scholar

Tukey, J. W. 1953. The problem of multiple comparisons. In The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948–1983, pp. 1–300. New York, USA: Chapman and Hall.Google Scholar

Tukey, J. W., 1977. Exploratory Data Analysis. Massachusetts, USA: Pearson Education, Inc.Google Scholar

van Rijsbergen, C. J. 1979. Information Retrieval, 2 edn. London, England: Butterworths.Google Scholar

Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2003). New York, USA: ACM, pp. 267–73.Google Scholar

Yang, T., Jin, R., Chi, Y., and Zhu, S., 2009. Combining link and content for community detection: a discriminative approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09. New York, USA: ACM, pp. 927–36.Google Scholar

Zha, H., He, X., Ding, C., Simon, H., and Gu, M. 2001. Spectral relaxation for k-means clustering. In The Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference, vol. 14, pp. 1057–64. Advances in Neural Information Processing Systems. Massachusetts, USA: MIT Press.Google Scholar

Zhang, D., Wang, J., and Si, L., 2011. Document clustering with universum. In Proceedings of the 34th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2011), New York, USA: ACM, pp. 873–82.Google Scholar

Zhou, Y., Cheng, H., and Yu, J. X., 2009. Graph clustering based on structural/attribute similarities. In Proceedings of the VLDB Endowment, vol. 2. Very Large Data Base Endowment Inc (VLDB Endowment), Lyon, France, pp. 718–29.Google Scholar

Zhao, Y., and Karypis, G., 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55 : 311–31.Google Scholar

Zhao, Y., Karypis, G., and Fayyad, U., 2005. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10 : 141–68.Google Scholar

Article contents

Silhouette + attraction: A simple and effective method for text clustering†

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests