K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. The basic step of k-means clustering is simple. In the beginning we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or
the first K objects in sequence can also serve as the initial centroids. Then the K means algorithm will do the three steps given below until convergence iterate until stable (= no object move group)
- Determine the centroid coordinate
- Determine the distance of each object to the centroids
- Group the object based on minimum distance
- With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).
- K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
The K-means method as described has the following drawbacks:
- It does not do well with overlapping clusters.
- The clusters are easily pulled off-center by outliers.
- Each record is either inside or outside of a given cluster.
The basic K-means algorithm has many variations. Many commercial software tools that include automatic cluster detection incorporate some of these variations. There are several different approaches to clustering,
including agglomerative clustering, divisive clustering, and self-organizing maps.
- answered 7 years ago
- G John