K-medians clustering

In statistics, k-medians clustering[1][2] is a cluster analysis algorithm. It is a generalization of the geometric median or 1-median algorithm, defined for a single cluster.

k-medians is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. This has the effect of minimizing error over all clusters with respect to the 2-norm distance metric, as opposed to the squared 2-norm distance metric (which k-means does).

This relates directly to the k-median problem which is the problem of finding k centers such that the clusters formed by them are the most compact with respect to the 2-norm. Formally, given a set of data points x, the k centers ci are to be chosen so as to minimize the sum of the distances from each x to the nearest ci.

The criterion function formulated in this way is sometimes a better criterion than that used in the k-means clustering algorithm, in which the sum of the squared distances is used. The sum of distances is widely used in applications such as the facility location problem.

The proposed algorithm uses Lloyd-style iteration which alternates between an expectation (E) and maximization (M) step, making this an expectation–maximization algorithm. In the E step, all objects are assigned to their nearest median. In the M step, the medians are recomputed by using the median in each single dimension.

  1. ^ A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice-Hall, 1988.
  2. ^ P. S. Bradley, O. L. Mangasarian, and W. N. Street, "Clustering via Concave Minimization," in Advances in Neural Information Processing Systems, vol. 9, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambridge, Massachusetts: MIT Press, 1997, pp. 368–374.