Hopkins statistic

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.[1] It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed.[2] A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0.[3]

Preliminaries

A typical formulation of the Hopkins statistic follows.[2]

Let

X

be the set of

n

data points.

Consider a random sample (without replacement) of

m\ll n

data points with members

x_{i}

.

Generate a set

Y

of

m

uniformly randomly distributed data points.

Define two distance measures,

u_{i},

the distance of

y_{i}\in Y

from its nearest neighbour in

X

, and

w_{i},

the distance of

m

number of randomly chosen

x_{i},

x_{i}\in X

from its nearest neighbour in

X

.

Definition

With the above notation, if the data is $d$ dimensional, then the Hopkins statistic is defined as:

$H={\frac {\sum _{i=1}^{m}{u_{i}^{d}}}{\sum _{i=1}^{m}{u_{i}^{d}}+\sum _{i=1}^{m}{w_{i}^{d}}}}\,$

Notes and references

Hopkins, Brian; Skellam, John Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. Annals Botany Co. 18 (2): 213–227.
Banerjee, A. (2004). "Validating clusters using the Hopkins statistic". IEEE International Conference on Fuzzy Systems: 149–153. doi:10.1109/FUZZY.2004.1375706.
Aggarwal, Charu C. (2015). Data Mining. Cham: Springer International Publishing. p. 158. doi:10.1007/978-3-319-14142-8. ISBN 978-3-319-14141-1.

External links

http://www.sthda.com/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Hopkins, Brian; Skellam, John Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. Annals Botany Co. 18 (2): 213–227.

[banerjee04-2] Banerjee, A. (2004). "Validating clusters using the Hopkins statistic". IEEE International Conference on Fuzzy Systems: 149–153. doi:10.1109/FUZZY.2004.1375706.

[3] Aggarwal, Charu C. (2015). Data Mining. Cham: Springer International Publishing. p. 158. doi:10.1007/978-3-319-14142-8. ISBN 978-3-319-14141-1.