Perform k-groups clustering by energy distance.

kgroups(x, k, iter.max = 10, nstart = 1, cluster = NULL)

Arguments

x

Data frame or data matrix or distance object

k

number of clusters

iter.max

maximum number of iterations

nstart

number of restarts

cluster

initial clustering vector

Details

K-groups is based on the multisample energy distance for comparing distributions. Based on the disco decomposition of total dispersion (a Gini type mean distance) the objective function should either maximize the total between cluster energy distance, or equivalently, minimize the total within cluster energy distance. It is more computationally efficient to minimize within distances, and that makes it possible to use a modified version of the Hartigan-Wong algorithm (1979) to implement K-groups clustering.

The within cluster Gini mean distance is $$G(C_j) = \frac{1}{n_j^2} \sum_{i,m=1}^{n_j} |x_{i,j} - x_{m,j}|$$ and the K-groups within cluster distance is $$W_j = \frac{n_j}{2}G(C_j) = \frac{1}{2 n_j} \sum_{i,m=1}^{n_j} |x_{i,j} - x_{m,j}.$$ If z is the data matrix for cluster \(C_j\), then \(W_j\) could be computed as sum(dist(z)) / nrow(z).

If cluster is not NULL, the clusters are initialized by this vector (can be a factor or integer vector). Otherwise clusters are initialized with random labels in k approximately equal size clusters.

If x is not a distance object (class(x) == "dist") then x is converted to a data matrix for analysis.

Run up to iter.max complete passes through the data set until a local min is reached. If nstart > 1, on second and later starts, clusters are initialized at random, and the best result is returned.

Value

An object of class kgroups containing the components

call

the function call

cluster

vector of cluster indices

sizes

cluster sizes

within

vector of Gini within cluster distances

W

sum of within cluster distances

count

number of moves

iterations

number of iterations

k

number of clusters

cluster is a vector containing the group labels, 1 to k. print.kgroups

prints some of the components of the kgroups object.

Expect that count is 0 if the algorithm converged to a local min (that is, 0 moves happened on the last iteration). If iterations equals iter.max and count is positive, then the algorithm did not converge to a local min.

Author

Maria Rizzo and Songzi Li

References

Li, Songzi (2015). "K-groups: A Generalization of K-means by Energy Distance." Ph.D. thesis, Bowling Green State University.

Li, S. and Rizzo, M. L. (2017). "K-groups: A Generalization of K-means Clustering". ArXiv e-print 1711.04359. https://arxiv.org/abs/1711.04359

Szekely, G. J., and M. L. Rizzo. "Testing for equal distributions in high dimension." InterStat 5, no. 16.10 (2004).

Rizzo, M. L., and G. J. Szekely. "Disco analysis: A nonparametric extension of analysis of variance." The Annals of Applied Statistics (2010): 1034-1055.

Hartigan, J. A. and Wong, M. A. (1979). "Algorithm AS 136: A K-means clustering algorithm." Applied Statistics, 28, 100-108. doi: 10.2307/2346830.

Examples

  x <- as.matrix(iris[ ,1:4])
  set.seed(123)
  kg <- kgroups(x, k = 3, iter.max = 5, nstart = 2)
  kg
#> 
#> kgroups(x = x, k = 3, iter.max = 5, nstart = 2)
#> 
#> K-groups cluster analysis
#> 3  groups of size  50 38 62 
#> Within cluster distances:
#>  17.07201 18.92376 31.53301
#> Iterations:  3   Count:  0 
  fitted(kg)
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>  [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
#> [149] 2 3
  
  # \donttest{
    d <- dist(x)
    set.seed(123)
    kg <- kgroups(d, k = 3, iter.max = 5, nstart = 2)
    kg
#> 
#> kgroups(x = d, k = 3, iter.max = 5, nstart = 2)
#> 
#> K-groups cluster analysis
#> 3  groups of size  50 38 62 
#> Within cluster distances:
#>  17.07201 18.92376 31.53301
#> Iterations:  3   Count:  0 
    
    kg$cluster
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>  [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
#> [149] 2 3
  
    fitted(kg)
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>  [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
#> [149] 2 3
    fitted(kg, method = "groups")
#> [[1]]
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> 
#> [[2]]
#>  [1]  53  78 101 103 104 105 106 108 109 110 111 112 113 116 117 118 119 121 123
#> [20] 125 126 129 130 131 132 133 135 136 137 138 140 141 142 144 145 146 148 149
#> 
#> [[3]]
#>  [1]  51  52  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70
#> [20]  71  72  73  74  75  76  77  79  80  81  82  83  84  85  86  87  88  89  90
#> [39]  91  92  93  94  95  96  97  98  99 100 102 107 114 115 120 122 124 127 128
#> [58] 134 139 143 147 150
#> 
    # }