Understanding k-Means Clustering

Cluster is a group of objects which have similar properties and belong to the same class.

What is Clustering?

Clustering is an unsupervised learning technique which is used to make clusters of objects i.e. it is a technique to group objects of similar kind in a group. In clustering, we first partition the set of data into groups based on the similarity and then assign the labels to those groups. Also, it helps us to find out various useful features that can help in distinguishing between different groups.

Types of Clustering

Most common categories of clustering are:-

  • Hierarchical Method
  • Density-based Method
  • Grid-based Method
  • Model-based Method

Partitioning Method

Partitioning method classifies the group of n objects into groups based on the features and similarity of data.

Hierarchical Method

The hierarchical method performs a hierarchical decomposition of the given set of data objects. It starts by considering every data point as a separate cluster and then iteratively identifies two clusters which can be closest together and then merge these two clusters into one. We continue this until all the clusters are merged together into a single big cluster. A diagram called Dendrogram is used to represent this hierarchy.

  • Divisive Approach

Density-based Method

Density-based method is based on the idea that continue growing the cluster as long as the density in the neighbourhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.

Model-based methods

In the model based method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method also helps in determining the number of clusters based on standard statistics (taking noise into consideration).

Writing K-means clustering code in Python from scratch

The basic idea behind the k-means clustering is to form the cluster based on the similarities between the attributes. First, we randomly choose two centroids for two clusters. Then we do iteration in which we calculate distance of an object point from chosen the centroids and assign that point to the cluster from which it has the least distance. Then we calculate the new centroid by considering the new added data point.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({    
'x': [1,2,4,5],
'y': [1,1,3,4]
})
#select the number of clusters
k=2
#initialize the centroids
centroids = [[df.x[0],df.y[0]],[df.x[1],df.y[1]]]
def assignment(df, centroids):    
df['distance_from_1'] = (np.sqrt((df['x'] - centroids[0][0]) ** 2 + (df['y'] - centroids[0][1]) ** 2))
df['distance_from_2'] = (np.sqrt((df['x'] - centroids[1][0]) ** 2 + (df['y'] - centroids[1][1]) ** 2))
centroid_distance_cols = ['distance_from_{}'.format(i) for i in (1,2)]
df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
df['color'] = df['closest'].map(lambda x: colmap[x])
return df
#assign different color to the cluster points
colmap = {1: 'r', 2: 'g'}
#assign the cluster to data points based on the
#distance from centroids of both clusters
df = assignment(df, centroids)
print(df.head())
def update(centroids):    
for i in (0,k-1):
centroids[i][0] = np.mean(df[df['closest'] == (i+1)]['x'])
centroids[i][1] = np.mean(df[df['closest'] == (i+1)]['y'])
return centroids
def deepcopy(x):
w, h = 2, 2;
new_centroids = [[0 for a in range(w)] for y in range(h)]
for i in (0,1):
new_centroids.append(x[i])
return x
def equals(x,y):
for i in (0,k-1):
if(x[i]!=y[i]):
return 0
return 1
#copy the centroids
old_centroids = deepcopy(centroids)
#find the new centroids of the clusters
centroids = update(centroids)
df = assignment(df, centroids)#keep on updating the centroid unless we get constant result
while True:
closest_centroids = deepcopy(df['closest'])
centroids = update(centroids)
df = assignment(df, centroids)
if equals(df['closest'],closest_centroids)==1:
break
fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in (0,k-1):
plt.scatter(*centroids[i], color=colmap[i+1])
print(df)
print(centroids)
plt.xlim(0, 7)
plt.ylim(0, 5)
plt.show()

Wrapping up

K-means clustering is widely used technique for data cluster analysis. However, its performance is usually not as competitive as those of the other clustering techniques because slight variations in the data could lead to high variance.

Machine Learning Enthusiast | Software Developer