eqdist.etest.Rd
Performs the nonparametric multisample E-statistic (energy) test for equality of multivariate distributions.
The k-sample multivariate \(\mathcal{E}\)-test of equal distributions
is performed. The statistic is computed from the original
pooled samples, stacked in matrix x
where each row
is a multivariate observation, or the corresponding distance matrix. The
first sizes[1]
rows of x
are the first sample, the next
sizes[2]
rows of x
are the second sample, etc.
The test is implemented by nonparametric bootstrap, an approximate
permutation test with R
replicates.
The function eqdist.e
returns the test statistic only; it simply
passes the arguments through to eqdist.etest
with R = 0
.
The k-sample multivariate \(\mathcal{E}\)-statistic for testing equal distributions
is returned. The statistic is computed from the original pooled samples, stacked in
matrix x
where each row is a multivariate observation, or from the distance
matrix x
of the original data. The
first sizes[1]
rows of x
are the first sample, the next
sizes[2]
rows of x
are the second sample, etc.
The two-sample \(\mathcal{E}\)-statistic proposed by Szekely and Rizzo (2004) is the e-distance \(e(S_i,S_j)\), defined for two samples \(S_i, S_j\) of size \(n_i, n_j\) by $$e(S_i,S_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], $$ where $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} \|X_{ip}-X_{jq}\|,$$ \(\|\cdot\|\) denotes Euclidean norm, and \(X_{ip}\) denotes the p-th observation in the i-th sample.
The original (default method) k-sample \(\mathcal{E}\)-statistic is defined by summing the pairwise e-distances over all \(k(k-1)/2\) pairs of samples: $$\mathcal{E}=\sum_{1 \leq i < j \leq k} e(S_i,S_j). $$ Large values of \(\mathcal{E}\) are significant.
The discoB
method computes the between-sample disco statistic.
For a one-way analysis, it is related to the original statistic as follows.
In the above equation, the weights \(\frac{n_i n_j}{n_i+n_j}\)
are replaced with
$$\frac{n_i + n_j}{2N}\frac{n_i n_j}{n_i+n_j} =
\frac{n_i n_j}{2N}$$
where N is the total number of observations: \(N=n_1+...+n_k\).
The discoF
method is based on the disco F ratio, while the discoB
method is based on the between sample component.
Also see disco
and disco.between
functions.
A list with class htest
containing
description of test
observed value of the test statistic
approximate p-value of the test
description of data
eqdist.e
returns test statistic only.
The pairwise e-distances between samples can be conveniently
computed by the edist
function, which returns a dist
object.
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).
M. L. Rizzo and G. J. Szekely (2010).
DISCO Analysis: A Nonparametric Extension of
Analysis of Variance, Annals of Applied Statistics,
Vol. 4, No. 2, 1034-1055.
doi:10.1214/09-AOAS245
Szekely, G. J. (2000) Technical Report 03-05: \(\mathcal{E}\)-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.
ksample.e
,
edist
,
disco
,
disco.between
,
energy.hclust
.
data(iris)
## test if the 3 varieties of iris data (d=4) have equal distributions
eqdist.etest(iris[,1:4], c(50,50,50), R = 199)
#>
#> Multivariate 3-sample E-test of equal distributions
#>
#> data: sample sizes 50 50 50, replicates 199
#> E-statistic = 357.71, p-value = 0.005
#>
## example that uses method="disco"
x <- matrix(rnorm(100), nrow=20)
y <- matrix(rnorm(100), nrow=20)
X <- rbind(x, y)
d <- dist(X)
# should match edist default statistic
set.seed(1234)
eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199)
#>
#> 2-sample E-test of equal distributions
#>
#> data: sample sizes 20 20, replicates 199
#> E-statistic = 1.9307, p-value = 0.93
#>
# comparison with edist
edist(d, sizes=c(20, 10), distance=TRUE)
#> 1
#> 2 1.954117
# for comparison
g <- as.factor(rep(1:2, c(20, 20)))
set.seed(1234)
disco(d, factors=g, distance=TRUE, R=199)
#> disco(x = d, factors = g, distance = TRUE, R = 199)
#>
#> Distance Components: index 1.00
#> Source Df Sum Dist Mean Dist F-ratio p-value
#> factors 1 0.96533 0.96533 0.625 0.93
#> Within 38 58.67770 1.54415
#> Total 39 59.64303
# should match statistic in edist method="discoB", above
set.seed(1234)
disco.between(d, factors=g, distance=TRUE, R=199)
#>
#> DISCO (Between-sample)
#>
#> data: d
#> DISCO between statistic = 0.96533, p-value = 0.93
#>