Multisample E-statistic (Energy) Test of Equal Distributions

Performs the nonparametric multisample E-statistic (energy) test for equality of multivariate distributions.

eqdist.etest(x, sizes, distance = FALSE,
    method=c("original","discoB","discoF"), R)
eqdist.e(x, sizes, distance = FALSE,
    method=c("original","discoB","discoF"))
ksample.e(x, sizes, distance = FALSE,
    method=c("original","discoB","discoF"), ix = 1:sum(sizes))

Arguments

x: data matrix of pooled sample
sizes: vector of sample sizes
distance: logical: if TRUE, first argument is a distance matrix
method: use original (default) or distance components (discoB, discoF)
R: number of bootstrap replicates
ix: a permutation of the row indices of x

Details

The k-sample multivariate $\mathcal{E}$-test of equal distributions is performed. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or the corresponding distance matrix. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc.

The test is implemented by nonparametric bootstrap, an approximate permutation test with R replicates.

The function eqdist.e returns the test statistic only; it simply passes the arguments through to eqdist.etest with R = 0.

The k-sample multivariate $\mathcal{E}$-statistic for testing equal distributions is returned. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or from the distance matrix x of the original data. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc.

The two-sample $\mathcal{E}$-statistic proposed by Szekely and Rizzo (2004) is the e-distance $e(S_i,S_j)$, defined for two samples $S_i, S_j$ of size $n_i, n_j$ by $$e(S_i,S_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], $$ where $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} \|X_{ip}-X_{jq}\|,$$ $\|\cdot\|$ denotes Euclidean norm, and $X_{ip}$ denotes the p-th observation in the i-th sample.

The original (default method) k-sample $\mathcal{E}$-statistic is defined by summing the pairwise e-distances over all $k(k-1)/2$ pairs of samples: $$\mathcal{E}=\sum_{1 \leq i < j \leq k} e(S_i,S_j). $$ Large values of $\mathcal{E}$ are significant.

The discoB method computes the between-sample disco statistic. For a one-way analysis, it is related to the original statistic as follows. In the above equation, the weights $\frac{n_i n_j}{n_i+n_j}$ are replaced with $$\frac{n_i + n_j}{2N}\frac{n_i n_j}{n_i+n_j} = \frac{n_i n_j}{2N}$$ where N is the total number of observations: $N=n_1+...+n_k$.

The discoF method is based on the disco F ratio, while the discoB method is based on the between sample component.

Also see disco and disco.between functions.

Value

A list with class htest containing

method: description of test
statistic: observed value of the test statistic
p.value: approximate p-value of the test
data.name: description of data

eqdist.e returns test statistic only.

Note

The pairwise e-distances between samples can be conveniently computed by the edist function, which returns a dist object.

References

Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).

M. L. Rizzo and G. J. Szekely (2010). DISCO Analysis: A Nonparametric Extension of Analysis of Variance, Annals of Applied Statistics, Vol. 4, No. 2, 1034-1055.
doi:10.1214/09-AOAS245

Szekely, G. J. (2000) Technical Report 03-05: $\mathcal{E}$-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.

Author

Maria L. Rizzo mrizzo@bgsu.edu and Gabor J. Szekely

Examples

 data(iris)

 ## test if the 3 varieties of iris data (d=4) have equal distributions
 eqdist.etest(iris[,1:4], c(50,50,50), R = 199)
#> 
#> 	Multivariate 3-sample E-test of equal distributions
#> 
#> data:  sample sizes 50 50 50, replicates 199
#> E-statistic = 357.71, p-value = 0.005
#> 

 ## example that uses method="disco"
  x <- matrix(rnorm(100), nrow=20)
  y <- matrix(rnorm(100), nrow=20)
  X <- rbind(x, y)
  d <- dist(X)

  # should match edist default statistic
  set.seed(1234)
  eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199)
#> 
#> 	2-sample E-test of equal distributions
#> 
#> data:  sample sizes 20 20, replicates 199
#> E-statistic = 1.9307, p-value = 0.93
#> 

  # comparison with edist
  edist(d, sizes=c(20, 10), distance=TRUE)
#>          1
#> 2 1.954117

  # for comparison
  g <- as.factor(rep(1:2, c(20, 20)))
  set.seed(1234)
  disco(d, factors=g, distance=TRUE, R=199)
#> disco(x = d, factors = g, distance = TRUE, R = 199)
#> 
#> Distance Components: index  1.00
#> Source            Df   Sum Dist  Mean Dist   F-ratio   p-value
#> factors            1    0.96533    0.96533     0.625      0.93
#> Within            38   58.67770    1.54415
#> Total             39   59.64303

  # should match statistic in edist method="discoB", above
  set.seed(1234)
  disco.between(d, factors=g, distance=TRUE, R=199)
#> 
#> 	DISCO (Between-sample)
#> 
#> data:  d
#> DISCO between statistic = 0.96533, p-value = 0.93
#>