Understanding K-Means Clustering with Real Use-Case in Security Domain

Varun
3 min readJul 19, 2021

What is K-Means Clustering ?

K-Means Clustering is an unsupervised learning algorithm, which groups the unlabeled dataset into different clusters and is used to solve the clustering problems in machine learning or data science World.

Here, K represents the number of pre-defined clusters that need to be created in the process.

like if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

In Simple Language :

“It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.”

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

How this Algorithm Works ?

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters and the value of k should be predetermined.

The k-means clustering algorithm mainly performs two tasks:

  • Determines the best value for K center points or centroids by an iterative process.
  • Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

Hence, each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram shows the working of the K-means Clustering Algorithm:

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but the most appropriate method to find the number of clusters or value of K is given below:

Elbow Method

This is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster.

Formula:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

Using K-Mean Clustering for Cyber Security Analytics

With the advancement in technology and the increase in the number of digital sources, data quantity increases every day and, consequently, the cyber security related data quantity. Traditional security systems such as Intrusion Detection Systems (IDS) are not capable of handling such a growing amount of data set in real time. Cyber security analytics is an alternative solution to such traditional security systems, which can use big data analytics techniques to provide a faster and scalable framework to handle a large amount of cyber security related data in real time.

k-means clustering is one of the commonly used clustering algorithms in cyber security analytics aimed at dividing security related data into groups of similar entities, which in turn can help in gaining important insights about the known and unknown attack patterns.

This technique helps a security analyst to focus on the data specific to some clusters only for the analysis. To improve performance, k-means can exploit the triangle inequality to skip many point-center distance computations, without affecting the clustering results.

I Hope, this information has added some value to your Knowledge !!!

THANKS FOR READING …

HOPE YOU LIKE IT !!!

--

--