E-distance

Returns the E-distances (energy statistics) between clusters.

edist(x, sizes, distance = FALSE, ix = 1:sum(sizes), alpha = 1,
        method = c("cluster","discoB"))

Arguments

x: data matrix of pooled sample or Euclidean distances
sizes: vector of sample sizes
distance: logical: if TRUE, x is a distance matrix
ix: a permutation of the row indices of x
alpha: distance exponent in (0,2]
method: how to weight the statistics

Details

A vector containing the pairwise two-sample multivariate $\mathcal{E}$-statistics for comparing clusters or samples is returned. The e-distance between clusters is computed from the original pooled data, stacked in matrix x where each row is a multivariate observation, or from the distance matrix x of the original data, or distance object returned by dist. The first sizes[1] rows of the original data matrix are the first sample, the next sizes[2] rows are the second sample, etc. The permutation vector ix may be used to obtain e-distances corresponding to a clustering solution at a given level in the hierarchy.

The default method cluster summarizes the e-distances between clusters in a table. The e-distance between two clusters $C_i, C_j$ of size $n_i, n_j$ proposed by Szekely and Rizzo (2005) is the e-distance $e(C_i,C_j)$, defined by $$e(C_i,C_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], $$ where $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} \|X_{ip}-X_{jq}\|^\alpha,$$ $\|\cdot\|$ denotes Euclidean norm, $\alpha=$ alpha, and $X_{ip}$ denotes the p-th observation in the i-th cluster. The exponent alpha should be in the interval (0,2].

The coefficient $\frac{n_i n_j}{n_i+n_j}$ is one-half of the harmonic mean of the sample sizes. The discoB method is related but with different ways of summarizing the pairwise differences between samples. The disco methods apply the coefficient $\frac{n_i n_j}{2N}$ where N is the total number of observations. This weights each (i,j) statistic by sample size relative to N. See the disco topic for more details.

Value

A object of class dist containing the lower triangle of the e-distance matrix of cluster distances corresponding to the permutation of indices ix is returned. The method attribute of the distance object is assigned a value of type, index.

References

Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method, Journal of Classification 22(2) 151-183.
doi:10.1007/s00357-005-0012-9

M. L. Rizzo and G. J. Szekely (2010). DISCO Analysis: A Nonparametric Extension of Analysis of Variance, Annals of Applied Statistics, Vol. 4, No. 2, 1034-1055.
doi:10.1214/09-AOAS245

Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).

Szekely, G. J. (2000) Technical Report 03-05, $\mathcal{E}$-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.

Author

Maria L. Rizzo mrizzo@bgsu.edu and Gabor J. Szekely

Examples

     ## compute cluster e-distances for 3 samples of iris data
     data(iris)
     edist(iris[,1:4], c(50,50,50))
#>           1         2
#> 2 123.55381          
#> 3 195.30396  38.85415
    
     ## pairwise disco statistics
     edist(iris[,1:4], c(50,50,50), method="discoB")  
#>          1        2
#> 2 41.18460         
#> 3 65.10132 12.95138

     ## compute e-distances from a distance object
     data(iris)
     edist(dist(iris[,1:4]), c(50, 50, 50), distance=TRUE, alpha = 1)
#>           1         2
#> 2 123.55381          
#> 3 195.30396  38.85415

     ## compute e-distances from a distance matrix
     data(iris)
     d <- as.matrix(dist(iris[,1:4]))
     edist(d, c(50, 50, 50), distance=TRUE, alpha = 1)
#>           1         2
#> 2 123.55381          
#> 3 195.30396  38.85415