Understanding k-Means Clustering
Cluster is a group of objects which have similar properties and belong to the same class.
What is Clustering?
Clustering is an unsupervised learning technique which is used to make clusters of objects i.e. it is a technique to group objects of similar kind in a group. In clustering, we first partition the set of data into groups based on the similarity and then assign the labels to those groups. Also, it helps us to find out various useful features that can help in distinguishing between different groups.
Types of Clustering
Most common categories of clustering are:-
- Partitioning Method
- Hierarchical Method
- Density-based Method
- Grid-based Method
- Model-based Method
Partitioning Method
Partitioning method classifies the group of n objects into groups based on the features and similarity of data.
The general problem would be like that we will have ‘n’ objects and we need to construct ‘k’ partitions among the data objects where each partition represents a cluster and will contain at least one object. Also, there is an additional condition that says each object can belong to only one group.
The partitioning method starts by creating an initial random partitioning. Then it iterates to improve the partitioning by moving the objects from one partition to another.
k-Means clustering follows the partitioning approach to classify the data.
Hierarchical Method
The hierarchical method performs a hierarchical decomposition of the given set of data objects. It starts by considering every data point as a separate cluster and then iteratively identifies two clusters which can be closest together and then merge these two clusters into one. We continue this until all the clusters are merged together into a single big cluster. A diagram called Dendrogram is used to represent this hierarchy.
There are two approaches depending on how we create the hierarchy −
- Agglomerative Approach
- Divisive Approach
Agglomerative Approach
Agglomerative approach is a type of hierarchical method which uses bottom-up strategy. We start with each object considering as a separate cluster and keeps on merging the objects that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.
Divisive Approach
Divisive approach is type of hierarchical method which uses top-down strategy. We start by considering all objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Density-based Method
Density-based method is based on the idea that continue growing the cluster as long as the density in the neighbourhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.
Model-based methods
In the model based method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method also helps in determining the number of clusters based on standard statistics (taking noise into consideration).
Writing K-means clustering code in Python from scratch
The basic idea behind the k-means clustering is to form the cluster based on the similarities between the attributes. First, we randomly choose two centroids for two clusters. Then we do iteration in which we calculate distance of an object point from chosen the centroids and assign that point to the cluster from which it has the least distance. Then we calculate the new centroid by considering the new added data point.
In this code, I am choosing basic data set with 4 data points and 2 random features
Let’s start the code by importing the relevant libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Pandas is a library written for the Python for data manipulation and analysis. It offers operations for manipulating numerical tables and time series. NumPy is another Python library which contains multi-dimensional array and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays such as trigonometric, statistical, and algebraic routines. Matplotlib.pyplot is used to plot different types of visual analysis of the data such as column graph, scatter matrix etc.
df = pd.DataFrame({
'x': [1,2,4,5],
'y': [1,1,3,4]
}) #select the number of clusters
k=2#initialize the centroids
centroids = [[df.x[0],df.y[0]],[df.x[1],df.y[1]]]
In the above code segment we have created a random data frame of two features and 4 data points. Then we initialize the number of clusters we need. It is either less than or equal to the number of data points in the data set. Further, we initialize the centroids. In this case, we have taken (1,1) and (2,1) as the centroids of two different clusters.
def assignment(df, centroids):
df['distance_from_1'] = (np.sqrt((df['x'] - centroids[0][0]) ** 2 + (df['y'] - centroids[0][1]) ** 2))
df['distance_from_2'] = (np.sqrt((df['x'] - centroids[1][0]) ** 2 + (df['y'] - centroids[1][1]) ** 2))
centroid_distance_cols = ['distance_from_{}'.format(i) for i in (1,2)]
df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
df['color'] = df['closest'].map(lambda x: colmap[x])
return df#assign different color to the cluster points
colmap = {1: 'r', 2: 'g'}#assign the cluster to data points based on the
#distance from centroids of both clusters
df = assignment(df, centroids)
print(df.head())
In the above code snippet, we calculate the distance of each point from the centroid and then assign a color to that data point according to the cluster with which it is more closer to. Now we would update the centroid by considering the newly added point in the clusters
def update(centroids):
for i in (0,k-1):
centroids[i][0] = np.mean(df[df['closest'] == (i+1)]['x'])
centroids[i][1] = np.mean(df[df['closest'] == (i+1)]['y'])
return centroidsdef deepcopy(x):
w, h = 2, 2;
new_centroids = [[0 for a in range(w)] for y in range(h)]
for i in (0,1):
new_centroids.append(x[i])
return xdef equals(x,y):
for i in (0,k-1):
if(x[i]!=y[i]):
return 0
return 1#copy the centroids
old_centroids = deepcopy(centroids)#find the new centroids of the clusters
centroids = update(centroids) df = assignment(df, centroids)#keep on updating the centroid unless we get constant result
while True:
closest_centroids = deepcopy(df['closest'])
centroids = update(centroids)
df = assignment(df, centroids)
if equals(df['closest'],closest_centroids)==1:
break fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')for i in (0,k-1):
plt.scatter(*centroids[i], color=colmap[i+1]) print(df)
print(centroids)
plt.xlim(0, 7)
plt.ylim(0, 5)
plt.show()
Output:
Wrapping up
K-means clustering is widely used technique for data cluster analysis. However, its performance is usually not as competitive as those of the other clustering techniques because slight variations in the data could lead to high variance.
Also, clusters in k-means clustering method are assumed to be spherical and evenly sized, which may reduce the accuracy in the results.