Wednesday, January 22, 2020

[egsdgobq] Multivariate Z statistic

To calculate the p value of an observation from a univariate (one-dimensional) normal distribution, calculate the Z statistic, then integrate the tails of the standard normal distribution.

If you have a multivariate (many dimensional) normal (Gaussian) distribution with a covariance matrix, compute the Mahalanobis distance and square it.  This statistic has a chi squared distribution with parameter k equal to the number of dimensions, whose tail can be integrated to compute a p value.  (Alternatively, don't square the Mahalanobis distance, then that statistic has a chi distribution.)  Pretty simple.

https://stats.stackexchange.com/questions/331283/how-to-calculate-the-probability-of-a-data-point-belonging-to-a-multivariate-nor

One can verify that the methods are equivalent for k=1.

The probability density function (PDF) of the chi squared distribution evaluated at zero is infinite for 1D, 0.5 for 2D, and 0 for 3 or more dimensions.  How is this reflected in 1D, 2D, and 3D Gaussian distributions?  One would think there should be significant qualitative differences between infinite, finite but nonzero, and zero.

The peak (mode) of the PDF is at k-2 and the mean is at k.  The standard deviation is sqrt(2k).  Thus, for an N dimensional spherically symmetric Gaussian distribution (Mahalanobis distance = Euclidean distance) with unit covariance (identity matrix), a large amount of probability mass is in a shell at radius sqrt(k-2) or sqrt(k) (let's use the latter) from the origin, with a width (thickness) sqrt(k+sqrt(2k))-sqrt(k-sqrt(2k)), which simplifies to sqrt(2) + O(1/sqrt(k)) for large k.  Mathematica:

Series[Sqrt[k+Sqrt[2k]]-Sqrt[k-Sqrt[2k]],{k,Infinity,1}]

Surprisingly, the width is constant, independent of dimension.  Similar results:

https://www.johndcook.com/blog/2011/09/01/multivariate-normal-shell/

https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/chap1-high-dim-space.pdf

No comments :