TY - JOUR
T1 - Effects of some design factors on the distribution of similarity indices in cluster analysis
AU - Albatineh, Ahmed N.
AU - Khan, Hafiz M.R.
AU - Zogheib, Bashar
AU - Kibria, Golam B.M.
N1 - Publisher Copyright:
© 2017 Taylor & Francis Group, LLC.
PY - 2017/5/28
Y1 - 2017/5/28
N2 - This article investigates the effects of number of clusters, cluster size, and correction for chance agreement on the distribution of two similarity indices, namely, Jaccard and Rand indices. Skewness and kurtosis are calculated for the two indices and their corrected forms then compared with those of the normal distribution. Three clustering algorithms are implemented: complete linkage, Ward, and K-means. Data were randomly generated from bivariate normal distributions with specified means and variance covariance matrices. Three-way ANOVA is performed to assess the significance of the design factors using skewness and kurtosis of the indices as responses. Test statistics for testing skewness and kurtosis and observed power are calculated. Simulation results showed that independent of the clustering algorithms or the similarity indices used, the interaction effect cluster size x number of clusters and the main effects of cluster size and number of clusters were found always significant for skewness and kurtosis. The three way interaction of cluster size x correction x number of clusters was significant for skewness of Rand and Jaccard indices using all clustering algorithms, but was not significant using Ward's method for both Rand and Jaccard indices, while significant for Jaccard only using complete linkage and K-means algorithms. The correction for chance agreement was significant for skewness and kurtosis using Rand and Jaccard indices when complete linkage method is used. Hence, such design factors must be taken into consideration when studying distribution of such indices.
AB - This article investigates the effects of number of clusters, cluster size, and correction for chance agreement on the distribution of two similarity indices, namely, Jaccard and Rand indices. Skewness and kurtosis are calculated for the two indices and their corrected forms then compared with those of the normal distribution. Three clustering algorithms are implemented: complete linkage, Ward, and K-means. Data were randomly generated from bivariate normal distributions with specified means and variance covariance matrices. Three-way ANOVA is performed to assess the significance of the design factors using skewness and kurtosis of the indices as responses. Test statistics for testing skewness and kurtosis and observed power are calculated. Simulation results showed that independent of the clustering algorithms or the similarity indices used, the interaction effect cluster size x number of clusters and the main effects of cluster size and number of clusters were found always significant for skewness and kurtosis. The three way interaction of cluster size x correction x number of clusters was significant for skewness of Rand and Jaccard indices using all clustering algorithms, but was not significant using Ward's method for both Rand and Jaccard indices, while significant for Jaccard only using complete linkage and K-means algorithms. The correction for chance agreement was significant for skewness and kurtosis using Rand and Jaccard indices when complete linkage method is used. Hence, such design factors must be taken into consideration when studying distribution of such indices.
KW - Correction for Chance Agreement
KW - Distribution
KW - Jaccard
KW - Rand
KW - Similarity indices
KW - Simulations
UR - http://www.scopus.com/inward/record.url?scp=85013040928&partnerID=8YFLogxK
U2 - 10.1080/03610918.2015.1082586
DO - 10.1080/03610918.2015.1082586
M3 - Article
SN - 0361-0918
VL - 46
SP - 4018
EP - 4034
JO - Communications in Statistics Part B: Simulation and Computation
JF - Communications in Statistics Part B: Simulation and Computation
IS - 5
ER -