TY - JOUR
T1 - How many clusters exist? Answer via maximum clustering similarity implemented in R
AU - Albatineh, Ahmed N.
AU - Wilcox, Meredith L.
AU - Zogheib, Bashar
AU - Niewiadomska-Bugaj, Magdalena
N1 - Publisher Copyright:
© 2019 International Biometric Society–Chinese Region.
PY - 2019
Y1 - 2019
N2 - Finding the number of clusters in a data set is considered as one of the fundamental problems in cluster analysis. This paper integrates maximum clustering similarity (MCS), for finding the optimal number of clusters, into R© statistical software through the package MCSim. The similarity between the two clustering methods is calculated at the same number of clusters, using Rand [Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–850.] and Jaccard [The distribution of the flora of the alpine zone. New Phytologist. 1912;11:37–50.] indices, corrected for chance agreement. The number of clusters at which the index attains its maximum with most frequency is a candidate for the optimal number of clusters. Unlike other criteria, MCS can be used with circular data. Seven clustering algorithms, existing in R©, are implemented in MCSim. A graph of the number of clusters vs. clusters similarity using corrected similarity indices is produced. Values of the similarity indices and a clustering tree (dendrogram) are produced. Several examples including simulated, real, and circular data sets are presented to show how MCSim successfully works in practice.
AB - Finding the number of clusters in a data set is considered as one of the fundamental problems in cluster analysis. This paper integrates maximum clustering similarity (MCS), for finding the optimal number of clusters, into R© statistical software through the package MCSim. The similarity between the two clustering methods is calculated at the same number of clusters, using Rand [Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–850.] and Jaccard [The distribution of the flora of the alpine zone. New Phytologist. 1912;11:37–50.] indices, corrected for chance agreement. The number of clusters at which the index attains its maximum with most frequency is a candidate for the optimal number of clusters. Unlike other criteria, MCS can be used with circular data. Seven clustering algorithms, existing in R©, are implemented in MCSim. A graph of the number of clusters vs. clusters similarity using corrected similarity indices is produced. Values of the similarity indices and a clustering tree (dendrogram) are produced. Several examples including simulated, real, and circular data sets are presented to show how MCSim successfully works in practice.
KW - Circular data
KW - Clustering algorithm
KW - Correction for chance agreement
KW - Number of clusters
KW - Similarity index
UR - http://www.scopus.com/inward/record.url?scp=85076350984&partnerID=8YFLogxK
U2 - 10.1080/24709360.2019.1615770
DO - 10.1080/24709360.2019.1615770
M3 - Article
SN - 2470-9360
VL - 3
SP - 62
EP - 79
JO - Biostatistics and Epidemiology
JF - Biostatistics and Epidemiology
IS - 1
ER -