DBSCAN algorithm in Python

In this tutorial, we will learn how we can implement and use the DBSCAN algorithm in Python.

In 1996, DBSCAN or Density-Based Spatial Clustering of Applications with Noise, a clustering algorithm, was first proposed, and it was awarded the 'Test of Time' award in the year 2014. The 'Test of Time' award was given to DBSCAN at Data Mining Conference, KDD. We will not learn about the DBSCAN algorithm here and only discuss the implementation of the DBSCAN algorithm in Python. But if we have to understand the implementation of the DBSCAN algorithm, we should have at least a basic idea about it. Therefore, if it is advisable that if you don't know what the DBSCAN algorithm is or how it works, then you should first learn about the DBSCAN algorithm and its working.

Implementation of DBSCAN algorithm in Python

We will perform the implementation operation of the DBSCAN algorithm in this section, and we will do this in steps so that it will be easy to understand and learn. We are going to use a dataset in this implementation process to perform various operations (including those we do in the DBSCAN algorithm) on it. Before we start the implementation process, we should fulfil the prerequisites to implement the DBSCAN algorithm inside a Python program.

Prerequisites for implementation of DBSCAN algorithm:

We have to fulfil the following prerequisites before we proceed with the implementation part of the DBSCAN algorithm in this section:

1. Numpy library: We should make sure that the numpy library is installed in our system and that too of the latest version as we are going to use functions on the numpy library on the dataset we will use in the implementation process. If numpy library is not present in our system or we haven't installed it before, then we can use the following command in the command prompt terminal present in our device to install it:

When we press the enter key, the numpy library is started installing in our system.

After some time, we will see that the numpy library is successfully installed in our system (Here, we already have the numpy library present in our system).

2. Panda library: Like numpy library, panda library is also the required library that should be present in our system, and if it is not present in our system, we can use the following command in the command prompt terminal to install it with pip installer:

3. matplotlib library: It is also an important library in the implementation process of the DBSCAN algorithm as functions of this library will help us display results from the dataset. If the matplotlib library is not present in our system, then we can use the following command in the command prompt terminal present to install it with pip installer:

4. Sklearn library: Sklearn library is going to be one of the major requirements while performing the implementation operation of the DBSCAN algorithm as we have to import various modules from the Sklearn library itself in the program, such as preprocessing decomposing etc. Therefore, we should make sure that the Sklearn library is present in our system or not, and if it is not present in our system, then we can use the following command in the command prompt terminal present to install it with pip installer:

5. Last but not least, we should also be aware of the DBSCAN algorithm (what it is and how it works), as we have discussed already, so that we can easily understand the implementation of it in Python.

Before we move forward, we should make sure that we have fulfilled all the prerequisites that we have listed down above so that we don't have to face any problems while following the implementation steps.

Implementation steps for the DBSCAN algorithm:

Now, we will perform the implementation of the DBSCAN algorithm in Python. Still, we will do this in steps as we have mentioned earlier so that the implementation part does not get any complex, and we can understand it very easily. We have to follow the following steps in order to implement the DBSCAN algorithm and its logic inside a Python program:

Step 1: Importing all the required libraries:

First and foremost, we have to import all the required libraries which we have installed in the prerequisites part so that we can use their functions while implementing the DBSCAN algorithm.

Here, we have firstly imported all the required libraries or modules of libraries inside the program:

# Importing numpy library as nmp
import numpy as nmp
# Importing pandas library as pds
import pandas as pds
# Importing matplotlib library as pplt
import matplotlib.pyplot as pplt
# Importing DBSCAN from cluster module of Sklearn library
from sklearn.cluster import DBSCAN
# Importing StandardSclaer and normalize from preprocessing module of Sklearn library
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
# Importing PCA from decomposition module of Sklearn
from sklearn.decomposition import PCA

Step 2: Loading the Data:

In this step, we have to load that data, and we can do this by importing or loading the dataset (that is required in the DBSCAN algorithm to work on it) inside the program. To load the dataset inside the program, we will use the read.csv() function of the panda's library and print the information from the dataset as we have done below:

# Loading the data inside an initialized variable
M = pds.read_csv('sampleDataset.csv') # Path of dataset file
# Dropping the CUST_ID column from the dataset with drop() function
M = M.drop('CUST_ID', axis = 1)
# Using fillna() function to handle missing values
M.fillna(method ='ffill', inplace = True)
# Printing dataset head in output
print(M.head())

Output:

       BALANCE  BALANCE_FREQUENCY  ...  PRC_FULL_PAYMENT  TENURE
0    40.900749           0.818182  ...          0.000000      12
1  3202.467416           0.909091  ...          0.222222      12
2  2495.148862           1.000000  ...          0.000000      12
3  1666.670542           0.636364  ...          0.000000      12
4   817.714335           1.000000  ...          0.000000      12

[5 rows x 17 columns]

The data as given in the output above will be printed when we run the program, and we will work on this data from the dataset file we loaded.

Step 3: Preprocessing the data:

Now, we will start preprocessing the data of the dataset in this step by using the functions of preprocessing module of the Sklearn library. We have to use the following technique while preprocessing the data with Sklearn library functions:

# Initializing a variable with the StandardSclaer() function
scalerFD = StandardScaler()
# Transforming the data of dataset with Scaler
M_scaled = scalerFD.fit_transform(M)
# To make sure that data will follow gaussian distribution
# We will normalize the scaled data with normalize() function
M_normalized = normalize(M_scaled)
# Now we will convert numpy arrays in the dataset into dataframes of panda
M_normalized = pds.DataFrame(M_normalized)

Step 4: Reduce the dimensionality of the data:

In this step, we will be reducing the dimensionality of the scaled and normalized data so that the data can be visualized easily inside the program. We have to use the PCA function in the following way in order to transform the data and reduce its dimensionality:

# Initializing a variable with the PCA() function
pcaFD = PCA(n_components = 2) # components of data
# Transforming the normalized data with PCA
M_principal = pcaFD.fit_transform(M_normalized)
# Making dataframes from the transformed data
M_principal = pds.DataFrame(M_principal)
# Creating two columns in the transformed data
M_principal.columns = ['C1', 'C2']
# Printing the head of the transformed data
print(M_principal.head())

Output:

         C1        C2
0 -0.489949 -0.679976
1 -0.519099  0.544828
2  0.330633  0.268877
3 -0.481656 -0.097610
4 -0.563512 -0.482506

As we can see in the output, we have transformed the normalized data into two components which is the two columns (we can see them in the output), using the PCA. And, after that, we made dataframes from transformed data using the panda library dataframe() function.

Step 5: Build a clustering model:

Now, this is the most important step of the implementation as here we have to build a clustering model of the data (on which we are performing operations), and we can do this by using the DBSCAN function of the Sklearn library as we have used below:

# Creating clustering model of the data using the DBSCAN function and providing parameters in it
db_default = DBSCAN(eps = 0.0375, min_samples = 3).fit(M_principal)
# Labelling the clusters we have created in the dataset
labeling = db_default.labels_

Step 6: Visualize the clustering model:

# Visualization of clustering model by giving different colours
colours = {}
# First colour in visualization is green
colours[0] = 'g'
# Second colour in visualization is black
colours[1] = 'k'
# Third colour in visualization is red
colours[2] = 'r'
# Last colour in visualization is blue
colours[-1] = 'b'
# Creating a colour vector for each data point in the dataset cluster
cvec = [colours[label] for label in labeling]
# Construction of the legend
# Scattering of green colour
g = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='g');
# Scattering of black colour
k = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='k');
# Scattering of red colour
r = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='r');
# Scattering of green colour
b = pplt.scatter(M_principal['C1'], M_principal['C2'], color ='b'); 
# Plotting C1 column on the X-Axis and C2 on the Y-Axis
# Fitting the size of the figure with figure function
pplt.figure(figsize =(9, 9))
# Scattering the data points in the Visualization graph
pplt.scatter(M_principal['C1'], M_principal['C2'], c = cvec)
# Building the legend with the coloured data points and labelled
pplt.legend((g, k, r, b), ('Label M.0', 'Label M.1', 'Label M.2', 'Label M.-1'))
# Showing Visualization in the output
pplt.show()

Output:

As we can see in the output, we have plotted the graph using the data points of the dataset and visualized the clustering by labelling the data points with different colours.

Step 7: Tuning the parameters:

In this step, we will be tuning the parameters of the module by changing the parameters that we have previously given in the DBSCAN function as follow:

# Tuning the parameters of the model inside the DBSCAN function
dts = DBSCAN(eps = 0.0375, min_samples = 50).fit(M_principal)
# Labelling the clusters of data points
labeling = dts.labels_

Step 8: Visualization of the changes:

Now, after tuning the parameters of the cluster model we created, we will visualize the changes that will come in the cluster by labelling the data points in the dataset with different colours as we have done before.

# Labelling with different colours
colours1 = {}
# labelling with Red colour
colours1[0] = 'r'
# labelling with Green colour
colours1[1] = 'g'
# labelling with Blue colour
colours1[2] = 'b'
colours1[3] = 'c'
# labelling with Yellow colour
colours1[4] = 'y'
# Magenta colour
colours1[5] = 'm'
# labelling with Black colour
colours1[-1] = 'k'
# Labelling the data points with the colour variable we have defined
cvec = [colours1[label] for label in labeling]
# Defining all colour that we will use
colors = ['r', 'g', 'b', 'c', 'y', 'm', 'k' ]
# Scattering the colours onto the data points
r = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[0])
g = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[1])
b = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[2])
c = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[3])
y = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[4])
m = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[5])
k = pplt.scatter(
        M_principal['C1'], M_principal['C2'], marker ='o', color = colors[6])
# Fitting the size of the figure with figure function
pplt.figure(figsize =(9, 9))
# Scattering column 1 into X-axis and column 2 into y-axis
pplt.scatter(M_principal['C1'], M_principal['C2'], c = cvec)
# Constructing a legend with the colours we have defined
pplt.legend((r, g, b, c, y, m, k),
           ('Label M.0', 'Label M.1', 'Label M.2', 'Label M.3', 'Label M.4','Label M.5', 'Label M.-1'), # Using different labels for data points
           scatterpoints = 1, # Defining the scatter point
           loc ='upper left', # Location of cluster scattering
           ncol = 3, # Number of columns
           fontsize = 10) # Size of the font
# Displaying the visualisation of changes in cluster scattering
pplt.show()

Output:

We can clearly observe the changes that have come in the cluster scattering of data points by tuning the parameters of the DBSCAN function by looking at the output. As we will observe the changes, we can also understand how the DBSCAN algorithm works and how it is helpful in the Visualization of cluster scattering of data points present in a dataset.

Next TopicHow to Write a Code for Printing the Python Exception/Error Hierarchy

← prev next →