Understanding k-Means Clustering

Cluster is a group of objects which have similar properties and belong to the same class.

What is Clustering?

Types of Clustering

  • Partitioning Method
  • Hierarchical Method
  • Density-based Method
  • Grid-based Method
  • Model-based Method

Partitioning Method

The general problem would be like that we will have ‘n’ objects and we need to construct ‘k’ partitions among the data objects where each partition represents a cluster and will contain at least one object. Also, there is an additional condition that says each object can belong to only one group.

The partitioning method starts by creating an initial random partitioning. Then it iterates to improve the partitioning by moving the objects from one partition to another.

k-Means clustering follows the partitioning approach to classify the data.

Hierarchical Method

There are two approaches depending on how we create the hierarchy −

  • Agglomerative Approach
  • Divisive Approach

Agglomerative Approach

Agglomerative approach is a type of hierarchical method which uses bottom-up strategy. We start with each object considering as a separate cluster and keeps on merging the objects that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach

Divisive approach is type of hierarchical method which uses top-down strategy. We start by considering all objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Density-based Method

Model-based methods

Writing K-means clustering code in Python from scratch

In this code, I am choosing basic data set with 4 data points and 2 random features

Let’s start the code by importing the relevant libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Pandas is a library written for the Python for data manipulation and analysis. It offers operations for manipulating numerical tables and time series. NumPy is another Python library which contains multi-dimensional array and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays such as trigonometric, statistical, and algebraic routines. Matplotlib.pyplot is used to plot different types of visual analysis of the data such as column graph, scatter matrix etc.

df = pd.DataFrame({    
'x': [1,2,4,5],
'y': [1,1,3,4]
})
#select the number of clusters
k=2
#initialize the centroids
centroids = [[df.x[0],df.y[0]],[df.x[1],df.y[1]]]

In the above code segment we have created a random data frame of two features and 4 data points. Then we initialize the number of clusters we need. It is either less than or equal to the number of data points in the data set. Further, we initialize the centroids. In this case, we have taken (1,1) and (2,1) as the centroids of two different clusters.

def assignment(df, centroids):    
df['distance_from_1'] = (np.sqrt((df['x'] - centroids[0][0]) ** 2 + (df['y'] - centroids[0][1]) ** 2))
df['distance_from_2'] = (np.sqrt((df['x'] - centroids[1][0]) ** 2 + (df['y'] - centroids[1][1]) ** 2))
centroid_distance_cols = ['distance_from_{}'.format(i) for i in (1,2)]
df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
df['color'] = df['closest'].map(lambda x: colmap[x])
return df
#assign different color to the cluster points
colmap = {1: 'r', 2: 'g'}
#assign the cluster to data points based on the
#distance from centroids of both clusters
df = assignment(df, centroids)
print(df.head())

In the above code snippet, we calculate the distance of each point from the centroid and then assign a color to that data point according to the cluster with which it is more closer to. Now we would update the centroid by considering the newly added point in the clusters

def update(centroids):    
for i in (0,k-1):
centroids[i][0] = np.mean(df[df['closest'] == (i+1)]['x'])
centroids[i][1] = np.mean(df[df['closest'] == (i+1)]['y'])
return centroids
def deepcopy(x):
w, h = 2, 2;
new_centroids = [[0 for a in range(w)] for y in range(h)]
for i in (0,1):
new_centroids.append(x[i])
return x
def equals(x,y):
for i in (0,k-1):
if(x[i]!=y[i]):
return 0
return 1
#copy the centroids
old_centroids = deepcopy(centroids)
#find the new centroids of the clusters
centroids = update(centroids)
df = assignment(df, centroids)#keep on updating the centroid unless we get constant result
while True:
closest_centroids = deepcopy(df['closest'])
centroids = update(centroids)
df = assignment(df, centroids)
if equals(df['closest'],closest_centroids)==1:
break
fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in (0,k-1):
plt.scatter(*centroids[i], color=colmap[i+1])
print(df)
print(centroids)
plt.xlim(0, 7)
plt.ylim(0, 5)
plt.show()

Output:

Wrapping up

Also, clusters in k-means clustering method are assumed to be spherical and evenly sized, which may reduce the accuracy in the results.

Machine Learning Enthusiast | Software Developer